split data into training and valuation datasets with not representative class

https://stackoverflow.com/questions/23594940

20-07-2023
|

Frage

I've got data set in which there are 130000 records and 15 variables.

The variable that I want to describe is IsActive. The problem is that there are only 15000 records with this variable set to 1 and the rest is set to 0.

First I want to split source data into two datasets:

20% ~30k records -> training data set

80% ~120k records -> validation data set.

I want to have 5k records with active = 1 in training dataset and 10k records with active = 1 in validation data set and have it easy to adjust.

How can I do this ?

What I have already done is:

set.seed(2)
ind <- sample(2, nrow(mydata), replace = TRUE, prob=c(0.8, 0.2))

And when I want to get 80% set of mydata:

newdata=mydata[ind == 1,]

Lösung

Your question still does not make sense: 20% of 130,000 is not 30,000. The simplest assumption that fixes all of your logical inconsistencies is that the dataset has 150,000 records, so I used that.

Here is one way to do it:

# sample data
set.seed(1)                  # for reproducible example
df <- data.frame(id=1:150000,
                 IsActive=sample(0:1,150000,replace=T,p=c(0.9,0.1)),
                 x=rnorm(150000), y=runif(150000),z=rpois(150000,l=1))
sum(df$IsActive==1)          # validate
# [1] 14887

s1 <- sample(which(df$IsActive==1),5000)
s2 <- sample(which(df$IsActive==0),25000)
train <- df[c(s1,s2),]
test  <- df[c(-s1,-s2),]
# validate
any(test$id %in% train$id)   # train and test are disjoint
# [1] FALSE
sum(train$IsActive==1)       # 5000
# [1] 5000
sum(test$IsActive==1)        # the rest
# [1] 9887

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow