Speed up text comparisons (feature vectors) with spatial MySQL features

https://stackoverflow.com/questions/1460618

12-09-2019
|

Question

I have a function which takes two arrays containing the tokens/words of two texts and gives out the cosine similarity value which shows the relationship between both texts.

The function takes an array $tokensA (0=>house, 1=>bike, 2=>man) and an array $tokensB (0=>bike, 1=>house, 2=>car) and calculates the similarity which is given back as a floating point value.

function cosineSimilarity($tokensA, $tokensB) {
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();
    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.

Is it possible to use MySQL's spatial columns to reduce the number of comparisons?

I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.

MySQL's spatial columns

You create spatial columns with: CREATE TABLE abc (clmnName TYPE)
possible types are listed here
here is how I select the data later [e.g. MultiPointFromText() or AsText()]
You insert values like this: INSERT INTO clmnName VALUES (GeomFromText('POINT(1 1)'))

But how do you use this for my problem?

PS: I'm looking for ways to reduce the number of comparisons with algorithms in this question. Vinko Vrsalovic told me that I should open another question for the spatial features.

Solution

While R-Trees in general can index data with arbitrary number of dimensions, MySQL spatial abilities are only limited to Geometry types (2 dimensions).

If your vectors are 2-dimensional and you can normalize them, then do the following:

Split the circle into twice the number of angles which fit your differences
Find the MBR of vectors with given cosine difference from the center of each sector
Find all vectors within the MBR
Do the fine filtering for exact difference.

In this case, however, it will be better just to precaculate the angle of the value and index it with a plain B-Tree index.

OTHER TIPS

In fact you have only 75 * 74 / 2 = 2775 comparisons. You compare every word with 74 others, but you don't need to compare word1 with word2 and again word2 with word1. So it gives half of comparisons less.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow