Creating a table showing the number of common values of two intersecting variables

StackOverflow https://stackoverflow.com/questions/22643216

  •  21-06-2023
  •  | 
  •  

Question

I want to make a table such that rows and columns are my variables and each cell shows the number of common values of two intersecting variables.

For example I have the following variables with the values.

 x    y    z
---  ---  ---
 *    b    #
 g    #    i
 #    *    l
 +    k    
      m    

Note that * and # are common in some variables. There is only one common value (i.e. #) between x and z, so the cell (x,z) will be one. The full table will look like the following.

    x    y    z
   ---  ---  ---
x | 4    2    1
y | 2    5    1
z | 1    1    3

How can I do it with SPSS and what are the keywords to describe this problem.

Était-ce utile?

La solution 2

Here is a simple Python solution. It creates a pivot table with the counts.

DATA LIST FREE / x (A1) y (A5) z (A8).
BEGIN DATA
 *    b    #
 g    #    i
 #    *    l
 +    k    ""
 ""   m    ""
END DATA.

begin program.
import spss, spssaux, spssdata

alldata = spssdata.Spssdata(names=False).fetchall()
nvars = len(alldata[0])
ncases = len(alldata)
vnames = [spss.GetVariableName(i) for i in range(nvars)]
empty = set([''])
varsets = {}
for v in range(nvars):
    varsets[v] = set([alldata[i][v].strip() for i in range(ncases)]).difference(empty)
rows = []
for v1 in range(nvars):
    counts = []
    for v2 in range(nvars):
        counts.append(len(varsets[v1].intersection(varsets[v2])))
    rows.append(counts)
spss.StartProcedure("ValuesInCommon")
table = spss.BasePivotTable("Values in Common", "COMMON")
table.SetDefaultFormatSpec(spss.FormatSpec.Count)
table.SimplePivotTable(rowdim = "Variable",
    rowlabels=vnames,
    coldim="Variables ",
    collabels=vnames,
    cells=rows)
spss.EndProcedure()
end program.

Autres conseils

There is a solution using macro language.

Test data

*** Set the work directory.
cd "C:\temp".


*** Define the list of variable names.
* You can make it using Utilities / Variables... command.
define !varl() x y z !end.


*** Generate the test data.
data list free
 /x (a1) y (a1) z (a1).
begin data
* b #
g # i
# * l
+ k ""
"" m ""
end data.

save out "data.sav".


*** Open each variable in a new data file, drop blanks, sort and save.
define !get_sort_save(x = !cmdend)
!do !var !in (!x)

get "data.sav"
 /keep !var.

sel if !var <> "".

sort cases by !var.

save out !quote(!con(!var, ".sav")).

!doend
!end.


* Run.
!get_sort_save x = !varl.



*** Match each variable with all other variables and count the matched cases.
define !match_count(x = !cmdend)

!do !var1 !in (!x)

get !quote(!con(!var1, ".sav"))
 /ren !var1 = id.

!do !var2 !in (!x)

match files
 /file = *
 /table = !quote(!con(!var2, ".sav"))
 /ren !var2 = id
 /in !var2
 /by id.

!doend

agg out *
 /!x = sum(!x).

form !x (f8).

save out !quote(!con(!var1, "_n.sav")).

!doend
!end.


* Run.
!match_count x = !varl.


*** Add the counts.
define !add(x = !cmdend)
add files
!do !var !in (!x)
 /file = !quote(!con(!var, "_n.sav"))
!doend.
!end.


* Run.
!add x = !varl.

list.

The result is

       x        y        z

       4        2        1
       2        5        1
       1        1        3

Number of cases read:  3    Number of cases listed:  3

Real data

This is a code to run on real data. The data must be saved in a data file. There are three places in the code where you have to make changes. They are marked with TASK1, TASK2 and TASK3.

*** TASK1 - set the work directory - the directory where the data file is saved.
cd "C:\temp".


*** TASK2 -  Define the list of variable names.
define !varl() x y z !end.


*** Open each variable in a new data file, drop blanks, sort and save.
define !get_sort_save(x = !cmdend)
!do !var !in (!x)

*** TASK3 - change the file name.
get "data.sav"
 /keep !var.

sel if !var <> "".

sort cases by !var.

save out !quote(!con(!var, ".sav")).

!doend
!end.


* Run.
!get_sort_save x = !varl.



*** Match each variable with all other variables and count the matched cases.
define !match_count(x = !cmdend)

!do !var1 !in (!x)

get !quote(!con(!var1, ".sav"))
 /ren !var1 = id.

!do !var2 !in (!x)

match files
 /file = *
 /table = !quote(!con(!var2, ".sav"))
 /ren !var2 = id
 /in !var2
 /by id.

!doend

agg out *
 /!x = sum(!x).

form !x (f8).

save out !quote(!con(!var1, "_n.sav")).

!doend
!end.


* Run.
!match_count x = !varl.


*** Add the counts.
define !add(x = !cmdend)
add files
!do !var !in (!x)
 /file = !quote(!con(!var, "_n.sav"))
!doend.
!end.


* Run.
!add x = !varl.

list.

This was a bit of a challenge - so here is my attempt to code this up in a reasonable format. So first lets make a dataset that looks like yours.

DATA LIST FREE / x y z (3A1).
BEGIN DATA
 *    b    #
 g    #    i
 #    *    l
 +    k    ""
 ""   m    ""
END DATA.

Now what I do is make a consistent list of symbols and then a dummy variable signaling if that symbol is contained in the original variable.

VARSTOCASES /MAKE V from x TO z /INDEX OrigVar (V).
SORT CASES BY V OrigVar.
CASESTOVARS /ID = V /VIND ROOT = "D" /INDEX = OrigVar. 

You will see that the data now looks like below:

V Dx Dy Dz 
- -- -- --
*  1  1  0 
#  1  1  1 
+  1  0  0 
b  0  1  0 
g  1  0  0 
i  0  0  1 
k  0  1  0 
l  0  0  1 
m  0  1  0

Now if you multiply Dx and Dy and then sum the column that is the calculation of the intersection of your two sets. Here I make a macro to ease calculating all of those multiplications over a list. (Unfortunately you can not use the TO convention here, you will need to list out all 25 variables for your use application on this macro.)

DEFINE !PairInter (!POSITIONAL = !CMDEND).
!DO !I !IN (!1)
!DO !J !IN (!1)
  COMPUTE !CONCAT(!I,"_",!J) = !I*!J.
  FORMATS !CONCAT(!I,"_",!J) (F3.0).
!DOEND
!DOEND
!ENDDEFINE.

!PairInter Dx Dy Dz.

You will see you now have a list of variables Dx_Dx Dx_Dy Dx_Dz Dy_Dx ..... Dz_Dz that is the full set of interactions of those variables. I have intentionally written the redundant interactions as it makes making the table easier later on (although I might suggest when displaying the table to only display the lower half).

So now if we sum over the columns we will have the cardinality of each set along with its intersection. Here I use LAG and just keep the final value in the dataset.

DO REPEAT D = Dx_Dx TO Dz_Dz.
  IF ($casenum<>1) D = LAG(D) + D.
END REPEAT.
COMPUTE Order = $casenum.
SORT CASES BY Order (D).
SELECT IF ($casenum = 1).
MATCH FILES FILE = * /DROP V TO Dz Order.
EXECUTE.

Now you can write a MATRIX procedure to reshape the dataset and print out the table in a nicer format. Here I FLIP the dataset and then grab the original variable names.

STRING I (A1).
COMPUTE I = "I".
FLIP /NEWNAMES = I.
RENAME VARIABLES (CASE_LBL = V).
COMPUTE V = CHAR.SUBSTR(V,LENGTH(V)).
EXECUTE.

MATRIX.
GET I /FILE = * /VARIABLE = I.
GET V /FILE = * /VARIABLE = V.
COMPUTE I2 = RESHAPE(I,3,3).
COMPUTE V2 = V(1:3).
PRINT I2 /RNAMES =V2 /CNAMES = V2.
END MATRIX.

The printed MATRIX statement then reads the table of intersections you wanted.

Run MATRIX procedure: 

I2 
   x  y  z 
x  4  2  1 
y  2  5  1 
z  1  1  3 

------ END MATRIX -----

I've made this into a macro, available here. After defining the macro you can simply run

!InterSet x y z.

and it will print the table.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top