Pergunta

Let's assume we have a table of People (name, surname, address, SSN, etc).

We want to find all rows that are "very similar" to specified person A. I would like to implement some kind of fuzzy logic comparation of A and all rows from table People. There will be several fuzzy inference rules working separately on several columns (e.g. 3 fuzzy rules for name, 2 rules on surname, 5 rules on address)

The question is Which of the following 2 approaches would be better and why?

  1. Implement all fuzzy rules as stored procedures and use one heavy SELECT statement to return all rows that are "very similar" to A. This approach may include using soundex, sim metric etc.

  2. Implement one or more simplier SELECT statements, that returns less accurate results, "rather similar" to A, and then fuzzy-compare A with all returned rows (outside database) to get "very similar" rows. So fuzzy comparation would be implemented in my favorit programming language.

Table People should have up to 500k rows, and I would like to make about 500-1000 queries like this a day. I use MySQL (but this is yet to be considered).

Foi útil?

Solução

I don't really think there is a definitive answer because it depends on information not available in the question. Anyway, too long for a comment.

DBMSes are good at retrieving information according to indexes. It does not make sense to have a db server wasting time in heavy computations unless it is dedicated for this specific purpose (as answered by @Adrian).

Therefore, your client application should delegate to the DBMS the retrieval of information required by the rules.

If the computations are minor, all could be done on the server. Else, pull it off into the client system.

The disadvantage of the second approach lies in the amount of data traveling from the server to the client and the number of connections to establish. So, typically it is a compromise between computation and data transfer in the server. A balance to be achieved depending on the specificities of the fuzzy rules.

Edit: I've seen in a comment that you are almost sure to have to implement the code in the client. In that case, you should consider an additional criterion, code locality, for maintenance purposes, i.e., try to have all code that is related together, not spreading it between systems (and languages).

Outras dicas

I would say you're best off using simple selects to get the closest matches you can without hammering the database, then do the heavy lifting in your application layer. The reason I would suggest this solution is scalability: if you do your heavy lifting in the application layer, your problem is a perfect use case for a map-reduce-style solution wherein you can distribute the processing of similarities across nodes and get your results back much faster than if you put it through the database; plus, this way, you're not locking up your database and slowing down any other operations that may be going on at the same time.

Since you're still considering what DB to use PostgreSQL has fuzzystrmatch module which provides Levenshtein and Soundex functions. Also, you might want to look on the pg_trm module as described here. Maybe you could also put the index on the column using soundex() so you won't have to calculate that every time. But you seem to optimize prematurely so my advice would be to test using pg and then wonder if you need to optimize or not, the numbers you provided really don't seem like a lot considered you almost have two minutes to run one query.

An option i'd consider is to add a column in the "People Talbe" that is the SoundEx value of the person.

I've done joins using

Select [Column}
From People P 
    Inner join TableA A  on Soundex(A.ComarisonColumn) = P.SoundexColumn

That'll return anything in TableA that has the same SoundEx value from the People Tables SoundEx Column.

I haven't used that kind of query on tables that size, but i see no issues with trying it. You can also index that SoundExColumn to help with performance.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top