Question

I have a database which has very similar rows within the same table. Those rows are similar because they have nearly equal column values. I need to integrate those corresponding rows into one single row.

For example, those two users (u1 and u2) should be integrated:

 u1 = User(name = "William Henry Gates III",
           age = 55,
           nationality = "american",
           alma_mater = "Harvard Univesity")

 u2 = User(name: "William Henry 'Bill' Gates III",
           age: 55,
           nationality: "America",
           alma_mater: "Harvard U.")

I am thinking of using some edit distance and stemming techniques. Other algorithms and techniques suggestions? Any helpful libraries to use (preferably in Python or Java)?

Was it helpful?

Solution

Considered something like Refine?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top