Question

I have a mysql database with the following format:

id      string
1        foo1...
2        foo2...
..       ...

There are >100k entries in this db.

What I want to do is for each string, compare it to each other string and store some metric of the comparison. Doing this will essentially yield a 2D matrix of size NxN where N is the number of row in the db.

My initial thought was creating another db where each index corresponds to the string of the index in the first db and each column is the value from comparing the two strings. For example, id 1 column 2 in the second db would be the value outputted from comparing id1 and id2 in the first db.

The format of the second db:

id    col1    col2    col3    ....
1       1      0.4     0.5    .....
...    ...     ...      ...

This way of creating the second db would result in 100k rows x 100k columns, which is the issue at hand. What is the best way to handle large data sets like this? Is storing the data in a text file more efficient (say each text file corresponds to one row in the second db.)

Was it helpful?

Solution

It is usually more practical to compute the distances on the fly rather than storing $N^2$ values. If possible, you will want to use a heuristic to select only the items could be sufficiently similar so you don't waste time computing distance to irrelevant items. PostgreSQL can create an index on character trigrams in strings that enables efficient querying for other strings that are sufficiently similar (measured by the Jaccard similarity coefficient).

Storing an $N$ column wide table is not possible because the maximum number of columns allowed in MySQL is 4096. Even if that were not the case, using an RDBMS in such a way is considered bad practice and such a table would be difficult to work with.

If for some reason you must store a dense distance matrix, a format such as HDF5 would be more efficient than an RDBMS.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top