Question

I have a large data set and I'd like to create 3 randomly selected (size = 50) subsets of the original data set. I just want to pull the points from a specific column (13th to be specific).

This must be easy to do in R, how should I go about it?

Was it helpful?

Solution

replicate(3, sample(200, 50))

Where 200 is the number of rows in the data frame (adjust accordingly). More automagically, assuming the data are in object df

replicate(3, sample(nrow(df), 50))

Here is an example

set.seed(10)
df <- data.frame(x1 = rnorm(1000), x2 = rnorm(1000))

ind <- replicate(3, sample(nrow(df), 50))
head(ind)

> head(ind)
     [,1] [,2] [,3]
[1,]  380  220  702
[2,]   75  751  720
[3,]  775  278  153
[4,]  988  612  340
[5,]  282  568  925
[6,]  266  794  812

The columns contain the 3 subsets you want. You could then use this to index the original data frame, e.g.

df[ind[,1], "x2"]

> df[ind[,1], "x2"]
 [1]  0.57982435  0.27016645 -0.08435526  1.16768142  1.38124150  0.62444167
 [7] -0.54887437  1.91301831  1.84116197  0.94045377 -1.15417235 -0.06809104
[13] -2.03652525  1.06773801 -0.34235315 -0.24707548 -1.80470122  0.11993674
[19] -0.36358182  0.16819156 -1.84507669 -0.16707925 -1.80789383  0.78894210
[25] -0.05741295 -0.28905260  2.38724835  2.75762831 -0.18082554  1.61820620
[31] -0.48192569 -0.03298339  0.52087746  0.32774925  1.52103207 -0.15619668
[37] -0.49687983 -0.06623606  2.21855213 -0.48727519  1.01115806  0.25213485
[43]  1.01927105  0.31362619  0.40260968  0.26795767  0.01803656  0.19579576
[49] -0.26464131  0.48141105

wherein I take the first subset and only variable x2.

Note this assumes that you want to sample without replacement; in other words that each row in df can occur 0 or 1 times only in a subset, not multiple times. If you want the latter, see the replace argument in ?sample.

OTHER TIPS

@Gavin solution is good, but it might generate subsets with non empty intersection. My solution guarantees that each row will be at most in one subset.

k <- 3
x <- sample(nrow(df), 50*k, replace = FALSE) 
split(x, ceiling(seq_along(x)/50))
$`1`
 [1] 595 392 370 504 494 167 633 264 648 465 757 566 914 406 104 486 965 360 426 724 442 583
[23] 252 732 588 513  76 514 142 843 923 806 540 470 128 356  20 391 117 879 185 977 849 820
[45] 174 170 157 737 692 308

$`2`
 [1]  48 207   7 415 850 777 525  85 389 440 503 459 718 455 865 108 453 810 864 608 567 184
[23] 731 954 575 579 784 795 435 898 106  53 450 841 916 768  26 919 860 502 858 481 225 303
[45] 272 646  49 422 803 320

$`3`
 [1] 596 447 516 789 948 893 218 838 100 493 958 410 353 982  93 581 188 822 660 230 696 891
[23] 892 368 161 786  50 326 984 944 478 483 690 776 642 522 203 475 325 449 305 134 463 582
[45] 432 548 759   1 578 825
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top