سؤال

I am importing a set of data from several files (excel files) that holds records with no identifiers on a daily basis. the data needs is then stored in a relational database (Oracle).

The problem is that the text may be slightly different from each resource and because there's no unique identifier I need to somehow base my comparison on text values.

Let's for example say that I get this information from different sources:

Source A: The Dark Knight
Source B: Batman The Dark Knight
Source C: The Dark Knight 2008
Source D: The Dark Knight Rises

if the database already hold an item with item_name as "The Dark Knight" then when i import this lines from sources A,B,C i'll get a "Full Match" but not for D cause that's a different movie.

Things to know:

  • The process is not a 100% automation, so if there's not match there will be a user interaction to match manually or create a new record.
  • Although there is a user interaction I want to keep it to the minimum (especially after user manually matched an item).

How do I go about to solve it without inflating the database with tons of synonyms to each item ?

هل كانت مفيدة؟

المحلول

Update 05/21/2013

I have found that: http://matpalm.com/resemblance/

It's the use of jaccard coefficient. Altough i'm not sure it's the best for my case cause of complexity, matching m x n times where m is the size of the imported records and n is the total database records that could be tens of thousands long.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top