Setting up a 4-D contingency table in R from flat, 5 column data.frame with 3 factors

StackOverflow https://stackoverflow.com/questions/17558937

  •  02-06-2022
  •  | 
  •  

Question

I have a data frame in R. The first two columns are my summed frequencies of "Yes" and "No." The final 3 columns are categorical factors, each with a label.

I am trying to make a 4-D contingency table from this format and I have no idea where to start the process.

My data looks like this:

    Sold    Unsold  Label1                   Label2                 Label3
1   3330    32102   AdvancedShopper: Y       TERR_USED: Non-TREE    SPINOFF: N
2   2735    30691   HSEHLD_INCDT_BAND: 0     CLM_FREE_INCDT_CT: 0   SPINOFF: N
3   3350    29485   TERR_USED: Non-TREE      CLM_FREE_INCDT_CT: 0   SPINOFF: N
4   3864    28657   SingleMulti: N           TERR_USED: Non-TREE    SPINOFF: N
5   2691    26355   TERR_USED: Non-TREE      HSEHLD_INCDT_BAND: 0   CLM_FREE_INCDT_CT: 0
6   2396    25884   TERR_USED: Non-TREE      HSEHLD_INCDT_BAND: 0   SPINOFF: N
7   2738    25172   Channel: Owned Agency    TERR_USED: Non-TREE    SPINOFF: N
8   3203    24425   TERR_USED: Non-TREE      FULL_CVG_FLG: Y        SPINOFF: N
9   2781    24163   SingleMulti: N           CLM_FREE_INCDT_CT: 0   SPINOFF: N
10  1950    22371   AdvancedShopper: Y       CLM_FREE_INCDT_CT: 0   SPINOFF: N
11  2644    21528   TERR_USED: Non-TREE      FULL_CVG_FLG: N        SPINOFF: N
12  2278    21736   Channel: Owned Agency    SingleMulti: N         SPINOFF: N
13  2324    21648   SingleMulti: N           HSEHLD_INCDT_BAND: 0   CLM_FREE_INCDT_CT: 0
14  3108    20780   Channel: Prudent         TERR_USED: Non-TREE    SPINOFF: N
15  2491    21216   TERR_USED: Non-TREE      PRIOR_BI: High         SPINOFF: N

I began with 8 columns: 3 Categories + 3 Values for each category + (1) number of Quotes written, and (1) number of sales on those Quotes = 8. I concatenated the respective category and value strings to form the three columns above. I have 19 categories, each category has its own number of attributes between 2 and 6. Sorting will put the respective columns in order, but not necessarily form the 4-D boxes for each combination of 3 categories and the respective Yes (Sold) and No (Unsold). The mean rate of sales is 11.4% and I would like to get the frequencies into shape to run Chi2 tests on these four-way contingencies to identify the combinations that create the strongest outliers from the mean. I have 80046 combinations, essentially (19 choose 3) with each of those three choices having their respective buckets, for example Row 1 is from a 4-D table of 16 cells (2 attr x 2 attr x 2 attr x [Y,N]), Row 2 is from a 4-D table of 96 cells (4 attr x 6 attr x 2 attr x [Y,N])... etc.

I'm unsure how to get this data into a format to start using the table() and xtabs() functions and thus the chi2.test. (Should I go back to the step before I concatenated the categories and values?)

I new to R, but I know it's supposed to be much better at programming for these large arrays. I don't have access to SPSS, but I do have access to SAS (also new in that) if there's something easier to try there...

Any sort of direction is a big help.

------------------- Desired output? reply --------------------- Well, the table command takes a data.frame from

Category 1       Category 2       Category 3       Y/N

...into contingency table format, right? But I already have my Yes's and No's in a frequency format with the three categories listed as such.

Do I need to change to this single instance format and explode my 80046 row table into millions of rows? Or is there a way to initiate the table command with the frequencies of Yes and No already tabulated in two columns?

Was it helpful?

Solution

In that case you can create a variable which gives a percentage of Yes over Yes+No: See whether this works for you (assume your data is sample).

mytab <- xtabs((100*Sold/(Sold+Unsold))~Label1+Label2+Label3, data=sample)

  > mytab
, , Label3 = CLM_FREE_INCDT_CT: 0

                       Label2
Label1                  CLM_FREE_INCDT_CT: 0 FULL_CVG_FLG: N FULL_CVG_FLG: Y HSEHLD_INCDT_BAND: 0 PRIOR_BI: High SingleMulti: N
  AdvancedShopper: Y                0.000000        0.000000        0.000000             0.000000       0.000000       0.000000
  Channel: Owned Agency             0.000000        0.000000        0.000000             0.000000       0.000000       0.000000
  Channel: Prudent                  0.000000        0.000000        0.000000             0.000000       0.000000       0.000000
  HSEHLD_INCDT_BAND: 0              0.000000        0.000000        0.000000             0.000000       0.000000       0.000000
  SingleMulti: N                    0.000000        0.000000        0.000000             9.694644       0.000000       0.000000
  TERR_USED: Non-TREE               0.000000        0.000000        0.000000             9.264615       0.000000       0.000000
                       Label2
Label1                  TERR_USED: Non-TREE
  AdvancedShopper: Y               0.000000
  Channel: Owned Agency            0.000000
  Channel: Prudent                 0.000000
  HSEHLD_INCDT_BAND: 0             0.000000
  SingleMulti: N                   0.000000
  TERR_USED: Non-TREE              0.000000

, , Label3 = SPINOFF: N

                       Label2
Label1                  CLM_FREE_INCDT_CT: 0 FULL_CVG_FLG: N FULL_CVG_FLG: Y HSEHLD_INCDT_BAND: 0 PRIOR_BI: High SingleMulti: N
  AdvancedShopper: Y                8.017762        0.000000        0.000000             0.000000       0.000000       0.000000
  Channel: Owned Agency             0.000000        0.000000        0.000000             0.000000       0.000000       9.486133
  Channel: Prudent                  0.000000        0.000000        0.000000             0.000000       0.000000       0.000000
  HSEHLD_INCDT_BAND: 0              8.182253        0.000000        0.000000             0.000000       0.000000       0.000000
  SingleMulti: N                   10.321407        0.000000        0.000000             0.000000       0.000000       0.000000
  TERR_USED: Non-TREE              10.202528       10.938276       11.593311             8.472419      10.507445       0.000000
                       Label2
Label1                  TERR_USED: Non-TREE
  AdvancedShopper: Y               9.398284
  Channel: Owned Agency            9.810104
  Channel: Prudent                13.010717
  HSEHLD_INCDT_BAND: 0             0.000000
  SingleMulti: N                  11.881553
  TERR_USED: Non-TREE              0.000000

Call: xtabs(formula = (100 * Sold/(Sold + Unsold)) ~ Label1 + Label2 + 
    Label3, data = l)
Number of cases in table: 150.7815 
Number of factors: 3 
Test for independence of all factors:
    Chisq = 412.2, df = 71, p-value = 1.48e-49
    Chi-squared approximation may be incorrect
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top