Вопрос

I have an R script that reads out some parameters via the commandArgs() function to see what kind of slices it should make in a dataset before saving these slices to a text file to be interpreted by a C++ program. The dataset is a survey done in the EU and I would like to be able to slice per respondent's country, by having relevant arguments in the commandArgs vector be compared to a string vector countries that contains all possible options. Using that and a corresponding integer matrix countryIndices, which contains the bounds of each country (i.e.: all Belgian correspondents are in rows 1-1043, so countryIndices[1,1]=1 and countryIndices[2,1]=1043), I wish to construct a matrix personIndices, that has all relevant bounds, using the which() function.

From this I want to make a vector that contains a sample of indices from the requested countries. The size of this vector is either sampleSize*sampleCountries (sampling sampleSize people per country) or simply sampleSize, depending on another parameter passed through. I was hoping that, at least for the latter type of sampling I could make this vector in one go, through the c() function, as follows

personIndices<-rbind(c(1,1043),c(2044,3061),c(8423,8922))
sampleVector<-c(personIndices[,1]:personIndices[,2])

And then sampling from that vector.

I'd hoped that this would make a vector containing the numbers 1:1043, 2044:3061 and 8423:8922, but this sadly does not seem to work. Any tips? Out of desperation I've constructed a monstrosity containing ifs in ifs in ifs and I'd rather not have it see the light of day if there's a smarter approach, but I haven't been able to find out. For reference as to what I'm doing (or if I wasn't being clear enough), said monstrosity can be found at http://pastebin.ca/2650188 Thanks in advance!

Это было полезно?

Решение

All the acrobatics with vectors of indices are unnecessary.

Logical indexing, subsetting are really all you need, using a new 'country' field (factor) you add to your data. (Maybe also plyr::ddply if you get real fancy)

All you want to do is allow the user to:

  1. Choose a country from a list (by selecting its number, 2-letter abbrev, whatever)...
  2. ... then sample in your dataset from within that country. That's all!

.

dat$country <- NA  # insert a new column, initialize to NA for pessimism, to catch omissions
dat$country[1:1043,]    <- 'Belgium'
dat$country[2044:3061,] <- 'Bulgaria'
dat$country[8423,8922,] <- 'Czech Rep'
...
# Now make country a factor instead of character
dat$country <- as.factor(dat$country)

# Now you can sample() using either logical indexing...
sample(dat[dat$country=='Bulgaria',] , ...)
# ...or subsetting
sample(subset(dat,country=='Bulgaria'), ...)

Другие советы

I would summarize your code as:

  1. If sampleType is TRUE, then draw a sample of size sampleSize from the indices corresponding to each country in sampleCountries, and return all these sampled indices together.
  2. If sampleType is FALSE, then group the indices corresponding to all the countries in sampleCountries together and draw a single sample of size sampleSize.

Let's setup some sample parameters:

sampleCountries <- c("BE", "WG")
sampleSize <- 20
sampleType <- F

The first step is to build a vector of the country for each index:

countries = c(rep("BE", 1043), rep("DM", 1000), rep("WG", 1018), rep("GR", 1003),
              rep("IT", 1021), rep("SP", 1021), rep("FR", 1008), rep("IR", 1000),
              rep("NI", 308), rep("LX", 500), rep("NL", 1022), rep("PT", 1000),
              rep("GB", 1066), rep("EG", 1014))

Next, when "ALL" is in sampleCountries you want to behave like all the countries are selected:

if ("ALL" %in% sampleCountries) {
  sampleCountries <- unique(countries)
}

Finally, draw your samples:

if (sampleType) {
  personIndices <- unlist(lapply(sampleCountries, function(x) {
    return(sample(which(countries == x), sampleSize, replace=F))
  }))
} else {
  personIndices <- sample(which(countries %in% sampleCountries), sampleSize,
                          replace=F)
}

In the first part of the if statement, which(countries == x) gets the indices of country x, and lapply does this for all the countries in your vector sampleCountries. Finally, unlist converts the output of lapply to a vector.

In the second part of the if statement, which(countries %in% sampleCountries) gets the indices of every country in sampleCountries.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top