How to create quantiles in R and plot histogram

https://stackoverflow.com/questions/23183003

06-07-2023
|

Pregunta

I have recently started working with R. I have a dataset which is composed of two columns and 100000 rows as shown below:

       Y    TOTA
1      1    403500.000
2      1    188334.000
3      0    812387.000
4      0    163626.000
5      1    49527.000
6      1    48661.000
7      0    36712.000
8      1    31745.000
9      1    23342.000
10     0    46835.000
...... .    .........
100000 0    10.982

The variable Y can have just two values: 0 or 1, whereas the variable TOTA can have various values. The function summary gives me the following result:

          Y               TOTA         
  Min.   :0.0000   Min.   :       0  
  1st Qu.:0.0000   1st Qu.:     939  
  Median :1.0000   Median :    3918  
  Mean   :0.5113   Mean   :   40245  
  3rd Qu.:1.0000   3rd Qu.:   11028  
  Max.   :1.0000   Max.   :18938000  
                   NA's   :261

AIM:

I would like to create a table with 10 rows and 3 columns. Each row represents a decile of my dataset and the last one shows NAs. Now I would like to populate my table looking at the dataset. If the first column in the dataset is 1 then add +1 to the created table where the value matches the value range of one of the columns and the column "Number Active Companies". If the first value is 0 then add +1 in the column of "Number Passive Companies" in the respective row where the value matches the table value ranges. Each row of the table represents a different range for the variable TOTA

WHAT I HAVE ATTEMPTED

What I have tried so far is to create a table which will contain the result of my dataset processing

    Number Active Companies  Number Passive Companies   Total
1   0                       0                           0
2   0                       0                           0
3   0                       0                           0
4   0                       0                           0
5   0                       0                           0
6   0                       0                           0
7   0                       0                           0
8   0                       0                           0
9   0                       0                           0
10  0                       0                           0



result<-matrix(data = 0, nrow = 10, ncol = 3, byrow = FALSE, dimnames = list(1:10,c("Number Active Companies","Number Passive Companies","Total")));

Afterwards I have created 10 groups which contain different range of my variable:

x > 0 && x < 100
x > 100 && x < 1000
x > 1000 && x < 10000
x > 10000 && x < 100000
x > 100000 && x < 1000000
x > 1000000 && x < 1000000
x > 5938000 && x < 10938000
x > 10938000 && x < 15938000
x > 15938000 && x < 18938000
x=NA

Now I would like to populate the previous table in this way. I want to analyse each row of the Y variable if it is 1 it should add 1 to the column number active companies and in the row in which the number belong to anc the same when Y is zero.

    for(i in TOTA){
    if (Y=1)
          if(x > 0 && x < 100){
          }else if(x > 100 && x < 1000){
          }else if(x > 1000 && x < 10000){
          }else if(x > 10000 && x < 100000){
          }else if(x > 100000 && x < 1000000){
          }else if( x > 1000000 && x < 1000000){
          }else if( x > 1000000 && x < 1000000){
          }else if( x > 5938000 && x < 10938000){
          }else if( x > 10938000 && x < 15938000){      
          }else if( x > 15938000 && x < 18938000) {
          }else{
           //Nas
          } 
    }else if(Y=0){

          if(x > 0 && x < 100){
          }else if(x > 100 && x < 1000){
          }else if(x > 1000 && x < 10000){
          }else if(x > 10000 && x < 100000){
          }else if(x > 100000 && x < 1000000){
          }else if( x > 1000000 && x < 1000000){
          }else if( x > 1000000 && x < 1000000){
          }else if( x > 5938000 && x < 10938000){
          }else if( x > 10938000 && x < 15938000){      
          }else if( x > 15938000 && x < 18938000) {
          }else{
           //Nas
          } 
    }

QUESTIONS

How can I write in the table? How can I do this process in a easier manner? How can I create an histogram of this table?

I am wondering whether I am doing the right thing, given the fact I have read the manual for the functions quantile() and percentile() and it seems they do the same thing

Can you please give me some guideline and possibly some commands to achieve my aim

Thank you

Solución

Still difficult to figure out what you are trying to accomplish, but this is my best guess:

# create reproducible example - you already have this...
set.seed(1)
df <- data.frame(Y=sample(0:1,100000,replace=T),
                 TOTA=runif(100000,0,18938000))
na     <- sample(1:100000,5000)    # 5% NA
df[na,]$TOTA <- NA

# you start here...
breaks <- c(0,10^(2:6), 5938000, 10938000, 15938000, 18938000)
labels <- c("0-100","100-1000","1000-10000","10000-100000",
            "100000-100000","100000-1000000","1000000-5938000",
            "5938000-10938000","10938000-18938000","NA")
df$group <- cut(df$TOTA,breaks=breaks,labels=F)
df[is.na(df$group),]$group <- 10
df$grpLabel <- labels[df$group]

result           <- aggregate(Y~group,df,function(x)sum(x==1))
colnames(result) <- c("Group","Active")
result$Passive   <- aggregate(Y~group,df,function(x)sum(x==0))$Y
result$Group     <- labels[result$Group]
result
#                Group Active Passive
# 1              0-100      0       1
# 2           100-1000      1       2
# 3         1000-10000     29      17
# 4       10000-100000    224     212
# 5      100000-100000   2310    2288
# 6     100000-1000000  12365   12328
# 7    1000000-5938000  12508   12522
# 8   5938000-10938000  12526   12649
# 9  10938000-18938000   7485    7533
# 10                NA   2544    2456

So this divides the dataset into groups using cut(...), then sums the 1s and 0s separately using aggregate(...), then labels the groups. Normally you could use cut(...) without labels=F and get meaningful labels for your groups directly. The problem here is that aggregate(...) will sort these alphabetically, which is not what you want.

Also, note that in your question you have a range 1000000 - 1000000 (e.g 1MM to 1MM). I assumed this is supposed to be 1000000 - 5938000.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow