Question

I would like to create a formula that will create a similarity matrix from a table of data. Here is an example of the data followed by the desired output. The rows of data represent the result of a modularity class algorithm. I need my output to calculate the number of months two countries share the same class value (It can be any value as long as they are equal) divided by the number of total months.

Input modularity class

        jan feb mar apr may
USA     0   1   2   4   3
UK      0   3   2   3   3
AU      0   2   2   2   3
CH      1   0   1   1   2
EG      2   3   0   0   1

Output similarity matrix

      USA   UK   AU  CH  EG
USA   1             
UK   0.6    1           
AU   0.6    0.6  1      
CH    0     0    0    1 
EG    0     0.2  0    0   1

I have not tried any formula because I dont understand where to begin. I have read about countifs and mmult, but I dont know what is most appropriate.

Was it helpful?

Solution

Right, there is a way using SUM(), IF(), COUNT() and using array formulas.

The basic formula that you could use is (assuming your top array is A1:F6, including the headers, so the data is in A2:F6):

{=SUM(IF($B$2:$F$2=$B2:$F2,1,0))/COUNT($B$2:$F$2)}

Using array formulas, you can make the IF() function return 1 for a match, and 0 for a mismatch, iterating through each element in a row. SUM() adds the matches up, and then dividing by the COUNT() of the number of cells processed gives you your similarity index.

The example above is for the top USA/USA cell in your example, you can fill down with that, but each new diagonal needs adjusting to change the fixed row number to the new row. So the top of the UK column would be:

{=SUM(IF($B$3:$F$3=$B3:$F3,1,0))/COUNT($B$3:$F$3)}

Conditional sum using array formulas

The COUNT() could be removed if you know how many columns/countries there are in advance.

Note: You do not type the curly braces. When you have finished entering the formula you have to press Ctrl-Shift-Enter whilst editing, to get those to appear and for the formula to be treated as an array formula. These formulas are often called CSE formulas for this reason (Ctrl-Shift-Enter).

Update:

You can do it with a single formula, filled over the cells, using INDIRECT() and COLUMN() as well.

{=SUM(IF(INDIRECT("$B$"&COLUMN(B2)):INDIRECT("$F$"&COLUMN(B2))=$B2:$F2,1,0))/COUNT(INDIRECT("$B$"&COLUMN(B2)):INDIRECT("$F$"&COLUMN(B2)))}

This uses the fact that the column number is the same as the row number, providing the transposition.

Update 2:

Actually, the COUNT() can be eliminated and SUM() replaced with AVERAGE() as all matches are 1, the mean is the correct value.

So this works for all cells:

{=AVERAGE(IF(INDIRECT("$B$"&COLUMN(B2)):INDIRECT("$F$"&COLUMN(B2))=$B2:$F2,1,0))}

Update 3:

If you desparately need nothing to appear on the upper diagonal side, making a lower triangular block, then you can use an IF() wrapping the above formula, checking to see if the column is greater than the row and making the cell blank in that case. You can then fill the entire block with the formula and it will look right, with no need for deleting.

{=IF(COLUMN(B2)>ROW(B2),"",AVERAGE(IF(INDIRECT("$B$"&COLUMN(B2)):INDIRECT("$F$"&COLUMN(B2))=$B2:$F2,1,0)))}

OTHER TIPS

I'm going to provide a nudge in the right direction, rather than the whole solution. I think you'll be able to generalize the solution from the formulas given.

I placed your input matrix in A1:F6. Then I set up an intermediate matrix. A9 is simply "Matches in column 1."

===edit: Using month names in columns ===

Then I copied the month names jan .. may in B10:F10. Also USA .. EG in A11 .. A15.

Then in B12, the fun starts. I put a formula, =IF(B3=B2,1,0). What this does is it places a 1 in the cell if B3=B2, or a 0 otherwise. Then I fill right from B12 to F12 and I get the following row calculated:

1   0   1   0   1

You will note that these are the months when UK = USA.

If you take B12 and fill down to B15 and then fill B12 .. B15 right to F15, you will get the difference between each country and its adjacent neighbor. My matrix turned out to be:

    jan feb mar apr may
USA                 
UK  1   0   1   0   1
AU  1   0   1   0   1
CH  0   0   0   0   0
EG  0   0   0   0   0

If you divide the sum of the rows by the number of countries, you will get 3/5, 3/5, 0/5, 0/5. These correspond to the output matrix of your first column, 0.6, 0.6, 0, 0.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top