Pergunta

I have a database which has very similar rows within the same table. Those rows are similar because they have nearly equal column values. I need to integrate those corresponding rows into one single row.

For example, those two users (u1 and u2) should be integrated:

 u1 = User(name = "William Henry Gates III",
           age = 55,
           nationality = "american",
           alma_mater = "Harvard Univesity")

 u2 = User(name: "William Henry 'Bill' Gates III",
           age: 55,
           nationality: "America",
           alma_mater: "Harvard U.")

I am thinking of using some edit distance and stemming techniques. Other algorithms and techniques suggestions? Any helpful libraries to use (preferably in Python or Java)?

Foi útil?

Solução

Considered something like Refine?

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top