سؤال

I have a longitudinal dataset which consists of one row for each observation for all individuals. There are several measurements on each observation, some which may be missing. Individuals have highly variable number of observations and there is a large amount of drop outs. Here is a section of the dataset

> head(mydata,33)
       id obstime agebase      cd4      rna  hem
1   10056       1      59 25.17936 3.611298 15.0
3   10056       3      59 21.33073 4.044030 15.4
4   10082       1      35 23.64318 5.275298 14.9
12  10082       9      35 22.31591 5.493349 14.4
22  10082      19      35       NA 5.875061 13.8
26  10082      23      35 18.84144 5.462503 13.9
28  10082      25      35 23.36664 2.397940 13.7
31  10082      28      35 26.55184       NA 15.3
34  10082      31      35 24.91987       NA 14.8
37  10082      34      35 24.08319       NA 15.5
41  10082      38      35 24.49490       NA 15.2
44  10082      41      35 26.00000       NA 15.5
48  10082      45      35 26.79552       NA 15.6
51  10082      48      35 24.53569       NA 14.9
55  10082      52      35 27.25803       NA 16.2
58  10082      55      35 26.47640       NA 15.4
61  10082      58      35 30.31501       NA 15.6
64  10082      61      35 27.01851       NA 15.8
67  10082      64      35 27.00000       NA   NA
70  10082      67      35 28.37252       NA 16.2
73  10082      70      35 27.20294       NA 14.9
77  10082      74      35 25.23886       NA 14.7
79  10082      76      35 28.65310       NA 15.8
82  10082      79      35 28.17801       NA   NA
85  10082      82      35 29.52965       NA 15.5
88  10082      85      35 29.52965 2.397940 15.5
89  10143       1      46 20.97618 4.361728 13.2
94  10143       6      46 22.00000 4.173507 14.0
98  10143      10      46 22.00000 4.173507 14.0
99  10215       1      33 20.49390 4.144605 16.0
......
> dim(mydata)
[1] 19793     6
> length(unique(mydata$id))
[1] 2161

What I need is to produce bootstrap samples from this dataset, where the individual clusters are preserved, such that if an individual is sampled, the entire set of observations for that id enters the bootstrap sample. An individual may of course be sampled more than once, which in that case it should enter the re-sampled data the appropriate number of times, and ideally receive an altered ID number say 10056.1, 10056.2 for example.

For now I am going to brute-force solve the problem as good as I can manage, but If anyone have any ideas on how I can do this fast, I would much appreciate it.

EDIT: what i ended up using

dat <- mydata
indiv <- unique(dat[, 1])
smp <- sort(sample(indiv, length(indiv), replace=TRUE))
smp.df <- data.frame(id=smp)
dat.b = merge(smp.df, dat, all.x=TRUE)    
# Number of observations for all IDs in original dataset
n.obs <- table(dat[, 1])
# Unique ids in the bootstrap sample
smpU <- unique(smp)
# Number of replicates sampled
reps <- as.vector(table(smp))
# Number of observations in the sampled IDs observation sets
obs <- as.vector(n.obs[match(smpU, names(n.obs))])

# Hacking the names
id.rep.obs <- cbind(smpU, reps, obs)   
NameFun <- function(info) {
  names <- as.numeric(paste0(rep(info[1], info[2]), ".", seq(1, info[2])))
  names.long <- sort(rep(names, info[3]))
}
dat.b[, 1] <- do.call("c", apply(id.rep.obs, 1, NameFun))
dat.b <- dat.b[order(dat.b[, 1], dat.b[, 2]), ]
هل كانت مفيدة؟

المحلول

You can use sample to create a list of ids, then merge().

First, recreate the data:

dat <- read.table(text="
       id obstime agebase      cd4      rna  hem
1   10056       1      59 25.17936 3.611298 15.0
3   10056       3      59 21.33073 4.044030 15.4
4   10082       1      35 23.64318 5.275298 14.9
12  10082       9      35 22.31591 5.493349 14.4
22  10082      19      35       NA 5.875061 13.8
26  10082      23      35 18.84144 5.462503 13.9
28  10082      25      35 23.36664 2.397940 13.7
31  10082      28      35 26.55184       NA 15.3
34  10082      31      35 24.91987       NA 14.8
37  10082      34      35 24.08319       NA 15.5
41  10082      38      35 24.49490       NA 15.2
44  10082      41      35 26.00000       NA 15.5
48  10082      45      35 26.79552       NA 15.6
51  10082      48      35 24.53569       NA 14.9
55  10082      52      35 27.25803       NA 16.2
58  10082      55      35 26.47640       NA 15.4
61  10082      58      35 30.31501       NA 15.6
64  10082      61      35 27.01851       NA 15.8
67  10082      64      35 27.00000       NA   NA
70  10082      67      35 28.37252       NA 16.2
73  10082      70      35 27.20294       NA 14.9
77  10082      74      35 25.23886       NA 14.7
79  10082      76      35 28.65310       NA 15.8
82  10082      79      35 28.17801       NA   NA
85  10082      82      35 29.52965       NA 15.5
88  10082      85      35 29.52965 2.397940 15.5
89  10143       1      46 20.97618 4.361728 13.2
94  10143       6      46 22.00000 4.173507 14.0
98  10143      10      46 22.00000 4.173507 14.0
99  10215       1      33 20.49390 4.144605 16.0", header=TRUE)

Now create a sample of id numbers:

set.seed(42)
indiv <- unique(dat$id)
smp <- data.frame(id=sample(indiv, 10, replace=TRUE))
smp

      id
1  10082
2  10143
3  10215
4  10082
5  10082
6  10215
7  10215
8  10056
9  10082
10 10143

Finally, merge:

merge(smp, dat, all.x=TRUE)

You'll notice that your sample is bootstrapped with multiple observations for each id set:

       id obstime agebase      cd4      rna  hem
1   10056       1      59 25.17936 3.611298 15.0
2   10056       3      59 21.33073 4.044030 15.4
3   10082      19      35       NA 5.875061 13.8
4   10082      23      35 18.84144 5.462503 13.9
5   10082       1      35 23.64318 5.275298 14.9
6   10082       9      35 22.31591 5.493349 14.4
7   10082      31      35 24.91987       NA 14.8
8   10082      34      35 24.08319       NA 15.5
9   10082      25      35 23.36664 2.397940 13.7
10  10082      28      35 26.55184       NA 15.3
11  10082      45      35 26.79552       NA 15.6
12  10082      48      35 24.53569       NA 14.9
13  10082      38      35 24.49490       NA 15.2
14  10082      41      35 26.00000       NA 15.5
15  10082      58      35 30.31501       NA 15.6
16  10082      61      35 27.01851       NA 15.8
17  10082      52      35 27.25803       NA 16.2
18  10082      55      35 26.47640       NA 15.4
19  10082      70      35 27.20294       NA 14.9
20  10082      74      35 25.23886       NA 14.7
21  10082      64      35 27.00000       NA   NA
22  10082      67      35 28.37252       NA 16.2
23  10082      82      35 29.52965       NA 15.5
24  10082      85      35 29.52965 2.397940 15.5
25  10082      76      35 28.65310       NA 15.8
26  10082      79      35 28.17801       NA   NA
27  10082      19      35       NA 5.875061 13.8
28  10082      23      35 18.84144 5.462503 13.9
29  10082       1      35 23.64318 5.275298 14.9
30  10082       9      35 22.31591 5.493349 14.4
31  10082      31      35 24.91987       NA 14.8
32  10082      34      35 24.08319       NA 15.5
33  10082      25      35 23.36664 2.397940 13.7
34  10082      28      35 26.55184       NA 15.3
35  10082      45      35 26.79552       NA 15.6
36  10082      48      35 24.53569       NA 14.9
37  10082      38      35 24.49490       NA 15.2
38  10082      41      35 26.00000       NA 15.5
39  10082      58      35 30.31501       NA 15.6
40  10082      61      35 27.01851       NA 15.8
41  10082      52      35 27.25803       NA 16.2
42  10082      55      35 26.47640       NA 15.4
43  10082      70      35 27.20294       NA 14.9
44  10082      74      35 25.23886       NA 14.7
45  10082      64      35 27.00000       NA   NA
46  10082      67      35 28.37252       NA 16.2
47  10082      82      35 29.52965       NA 15.5
48  10082      85      35 29.52965 2.397940 15.5
49  10082      76      35 28.65310       NA 15.8
50  10082      79      35 28.17801       NA   NA
51  10082      19      35       NA 5.875061 13.8
52  10082      23      35 18.84144 5.462503 13.9
53  10082       1      35 23.64318 5.275298 14.9
54  10082       9      35 22.31591 5.493349 14.4
55  10082      31      35 24.91987       NA 14.8
56  10082      34      35 24.08319       NA 15.5
57  10082      25      35 23.36664 2.397940 13.7
58  10082      28      35 26.55184       NA 15.3
59  10082      45      35 26.79552       NA 15.6
60  10082      48      35 24.53569       NA 14.9
61  10082      38      35 24.49490       NA 15.2
62  10082      41      35 26.00000       NA 15.5
63  10082      58      35 30.31501       NA 15.6
64  10082      61      35 27.01851       NA 15.8
65  10082      52      35 27.25803       NA 16.2
66  10082      55      35 26.47640       NA 15.4
67  10082      70      35 27.20294       NA 14.9
68  10082      74      35 25.23886       NA 14.7
69  10082      64      35 27.00000       NA   NA
70  10082      67      35 28.37252       NA 16.2
71  10082      82      35 29.52965       NA 15.5
72  10082      85      35 29.52965 2.397940 15.5
73  10082      76      35 28.65310       NA 15.8
74  10082      79      35 28.17801       NA   NA
75  10082      19      35       NA 5.875061 13.8
76  10082      23      35 18.84144 5.462503 13.9
77  10082       1      35 23.64318 5.275298 14.9
78  10082       9      35 22.31591 5.493349 14.4
79  10082      31      35 24.91987       NA 14.8
80  10082      34      35 24.08319       NA 15.5
81  10082      25      35 23.36664 2.397940 13.7
82  10082      28      35 26.55184       NA 15.3
83  10082      45      35 26.79552       NA 15.6
84  10082      48      35 24.53569       NA 14.9
85  10082      38      35 24.49490       NA 15.2
86  10082      41      35 26.00000       NA 15.5
87  10082      58      35 30.31501       NA 15.6
88  10082      61      35 27.01851       NA 15.8
89  10082      52      35 27.25803       NA 16.2
90  10082      55      35 26.47640       NA 15.4
91  10082      70      35 27.20294       NA 14.9
92  10082      74      35 25.23886       NA 14.7
93  10082      64      35 27.00000       NA   NA
94  10082      67      35 28.37252       NA 16.2
95  10082      82      35 29.52965       NA 15.5
96  10082      85      35 29.52965 2.397940 15.5
97  10082      76      35 28.65310       NA 15.8
98  10082      79      35 28.17801       NA   NA
99  10143      10      46 22.00000 4.173507 14.0
100 10143       1      46 20.97618 4.361728 13.2
101 10143       6      46 22.00000 4.173507 14.0
102 10143      10      46 22.00000 4.173507 14.0
103 10143       1      46 20.97618 4.361728 13.2
104 10143       6      46 22.00000 4.173507 14.0
105 10215       1      33 20.49390 4.144605 16.0
106 10215       1      33 20.49390 4.144605 16.0
107 10215       1      33 20.49390 4.144605 16.0

نصائح أخرى

I know this is an old question, I had a solution of my own.

The basic idea is to split data to list by id, then sample the id. Use the sampled id to match the list and create new id:

out <- split(dat, f = dat$id)

smp.id <- sample(dat$id, length(unique(dat$id)), replace = TRUE)

samp.df <- lapply(seq_along(smp.id), function(x){
  res <- out[[as.character(smp.id[x])]] # To avoid numeric ID
  res$newID <- x
  return(res)
})

samp.df <- do.call(rbind, samp.df)

The variable newID helps to distinguish between bootstrapped samples.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top