Merging and reshaping data/table with multiple tables of different sizes

https://stackoverflow.com/questions/22378632

14-06-2023
|

Domanda

My goal is to get a table that, for a list of categorical variables, returns (from left most column to right most column): the categorical variable name, the categorical variable level, the frequency for the first level of a binary grouping variable, the frequency for the second level of a binary grouping variable, the chi-squared test stat, the p.value, and the testing method. Example of the output I want is presented at the very bottom of the page. The current output and code is for a single categorical variable. I'm trying not to put the horse before the carriage. Right now getting the right format for a single variable will be good. I'll work on getting it to do it for a string and then rbind them together after that.

The code presents what I could figure out thus far. I'm fairly certain there is an easier way to do this. I've been told about tables::tabular, but could get that to do exactly what I wanted. I currently can't figure out the reshape (and then how to get rid of duplicates in the final three columns once that worked, but I'm not there yet).

Any help using the current code, or a different method, would be very much appreciated.

#make data (I couldn't get return() to work, so I used <<)
get.data<-function(){
set.seed(1)
cat1 <-sample(c(1,2), 100, replace=T)
cont1<-rnorm(100, 25, 8)
cont2<-rnorm(100, 0, 1)
cont3<-rnorm(100, 6, 14.23)
cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
cat2<-sample(c(1,2,3,4),100,replace=TRUE)
cat3<-sample(c(1,2,3,4,5),100,replace=TRUE)
cat4<-sample(c("Caucasian","African American", "Latino", "Multi-Racial", "No   
Response"),100,replace=TRUE)
group<-sample(c(0,1), 100, replace=T)
sex<-sample(c("male", "female"), 100, replace=T)
one  <<-data.frame(group, sex,cat1, cont1, cont2, cont3, cont4,cat2,cat3,cat4)
}

get.data()

#getting the two bits of data I would like
attach(one)
long <- (with(one, table(cat2,group)))
test<-with(one, chisq.test(cat2,group))
kk<-c(test$statistic,test$p.value,test$method)
detach(one)

#merging them together
res<-merge(as.data.frame(as.matrix(long)), as.data.frame(as.matrix(kk)),
     all=TRUE, sort=FALSE)
#unsuccessfully reshaping the data
wider <- reshape(as.data.frame(res), idvar = cat2,
     timevar = "V1", direction = "wide")

Here is what the output from 'res' looks like:

#   cat2    group   Freq    V1
#1  1   0   17  1.16345446805217
#2  2   0   11  1.16345446805217
#3  3   0   13  1.16345446805217
#4  4   0   13  1.16345446805217
#5  1   1   12  1.16345446805217
#6  2   1   13  1.16345446805217
#7  3   1   9   1.16345446805217
#8  4   1   12  1.16345446805217
#9  1   0   17  0.761782111152171
#10 2   0   11  0.761782111152171
#11 3   0   13  0.761782111152171
#12 4   0   13  0.761782111152171
#13 1   1   12  0.761782111152171
#14 2   1   13  0.761782111152171
#15 3   1   9   0.761782111152171
#16 4   1   12  0.761782111152171
#17 1   0   17  Pearson's Chi-squared test
#18 2   0   11  Pearson's Chi-squared test
#19 3   0   13  Pearson's Chi-squared test
#20 4   0   13  Pearson's Chi-squared test
#21 1   1   12  Pearson's Chi-squared test
#22 2   1   13  Pearson's Chi-squared test
#23 3   1   9   Pearson's Chi-squared test
#24 4   1   12  Pearson's Chi-squared test

HERE IS WHAT I WANT THE OUTPUT TO LOOK LIKE:

Variable     Response    Group1.Freq    Group2.Freq    Test.Stat    p.value     method
Cat2         1           17             12             1.16         0.761       Pearson's Chi...
             2           11             13
             3           13             9
             4           13             12

NEW ISSUE: I used Ram's suggestion to make a function so that I could make a data.frame for multiple categorical variables. I came up with this code. But the output messed up during the rbind and lapply steps. I'm wondering how to go about fixing this issue. Again, output is at the bottom.

get.data<-function(){
  set.seed(1)
  cat1 <-sample(c(1,2), 100, replace=T)
  cont1<-rnorm(100, 25, 8)
  cont2<-rnorm(100, 0, 1)
  cont3<-rnorm(100, 6, 14.23)
  cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
  cat2<-sample(c(1,2,3,4),100,replace=TRUE)
  cat3<-sample(c(1,2,3,4,5),100,replace=TRUE)
  cat4<-sample(c("Caucasian","African American", "Latino", "Multi-Racial", "No   
     Response"),100,replace=TRUE)
  group<-sample(c(0,1), 100, replace=T)
  sex<-sample(c("male", "female"), 100, replace=T)
  one  <<-data.frame(group, sex,cat1, cont1, cont2, cont3, cont4,cat2,cat3,cat4)
}

get.data()

make.table<-function(catvars,group,data){
  attach(data)
get.chi.stuff<-function(cat, group){
  long <- table(cat,group)
  test<-chisq.test(cat,group)
  kk<-c(test$statistic,test$p.value,test$method)
  res <- data.frame(matrix(NA,nrow(long),7))
  names(res) <- c("Variable", "Response", "Group1.Freq", "Group2.Freq",
              "Test.Stat", "p.value", "method")
  res[1,1] <- deparse(substitute(cat))
  res[,2] <- row.names(long)
  res[,3:4] <- long[,1:2]
  res[1,5:7] <- kk

  return(res)
}
tables<<-do.call(rbind,lapply(data[,catvars],get.chi.stuff,group=group))

detach(data)
}
make.table(catvars=catvars,group=group, data=one)

OUTPUT (It's currently not formatting like it should, but the issue is row.names and Variable. The rest looks fine)

row.names   Variable    Response    Group1.Freq Group2.Freq Test.Stat   p.value method
    cat2.1  X[[1L]] 1   17  12  1.16345446805217    0.761782111152171   Pearson's Chi-squared test
    cat2.2  NA  2   11  13  NA  NA  NA
    cat2.3  NA  3   13  9   NA  NA  NA
    cat2.4  NA  4   13  12  NA  NA  NA
    cat3.1  X[[2L]] 1   8   15  5.68288366946583    0.224115426983988   Pearson's Chi-squared test
 6  cat3.2  NA  2   10  7   NA  NA  NA
 7  cat3.3  NA  3   14  11  NA  NA  NA
 8  cat3.4  NA  4   8   7   NA  NA  NA
 9  cat3.5  NA  5   14  6   NA  NA  NA
 10 cat4.1  X[[3L]] African American    9   18  8.73180996607079    0.0681639164530817  Pearson's Chi-squared test
 11 cat4.2  NA  Caucasian   14  5   NA  NA  NA
 12 cat4.3  NA  Latino  6   7   NA  NA  NA
 13 cat4.4  NA  Multi-Racial    14  9   NA  NA  NA
 14 cat4.5  NA  No   
Response    11  7   NA  NA  NA
 15 sex.1   X[[4L]] female  30  17  2.74327353028067    0.0976645121155453  Pearson's Chi-squared test with Yates' continuity correction
 16 sex.2   NA  male    24  29  NA  NA  NA

Soluzione

Since you are using merge it creates a data frame with recycling, which is not what you want for your res

You have created all the components you want in your res in your variables, long, kk and test. So now it is a matter of stitching it all together in the specific format that you want.

This is not very elegant, because we are constructing the desired results by hand, column by column. You could throw all of this into a function.

res <- data.frame(matrix(NA,nrow(long),7))
names(res) <- c("Variable", "Response", "Group1.Freq", "Group2.Freq",
                  "Test.Stat", "p.value", "method")
res[1,1] <- names(attr(test$observed, "dimnames")[1])
res[,2] <- row.names(long)
res[,3:4] <- long[,1:2]
res[1,5:7] <- kk
res
#  Variable Response Group1.Freq Group2.Freq        Test.Stat
# 1     cat2        1          17          12 1.16345446805217
# 2     <NA>        2          11          13             <NA>
# 3     <NA>        3          13           9             <NA>
# 4     <NA>        4          13          12             <NA>
#            p.value                     method
# 1 0.761782111152171 Pearson's Chi-squared test
# 2              <NA>                       <NA>
# 3              <NA>                       <NA>
# 4              <NA>                       <NA>

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow