Вопрос

I am trying to find duplicates in a list of about 5000 records. Each record is a person's name and address, but all typed inconsistently into one field, so I'm trying a fuzzy matching approach. My methodology (using rapidminer) is to do some pre-processing on the text (i.e. tokenization, removing common and irrelevant words such as "Mr", etc), generate a TF-IDF and use DBSCAN to cluster matching records. This works, and gives pretty good results, but takes a very long time when I try to run the full dataset. It also results in a lot of clusters with only one element, and I don't know how this affects DBSCAN's calculation time.

Is there a clustering algorithm which would work faster for this sort of data, or is there a better way of solving this problem?

Это было полезно?

Решение

Are you sure that clustering is the best approach?

To me this sounds as if your are doing near-duplicate detection, i.e. you define some threshold and just want to find if there is any other object within this similarity threshold. If you are using DBSCAN with a low minPts value (say, minPts=2), you aren't really doing DBSCAN.

If DBSCAN produces single-element clusters, it is implemented incorrectly. There can't be single element cluster in DBSCAN; these are noise objects and should be treated as such.

I don't know about the quality of DBSCAN in RapidMiner. If it is based on the Weka code, it is crap (i.e. don't even bother to try Weka!). And really really slow. It at least is spelled incorrectly: DBScan is incorrect, it is an abbreviation and should be spelled DBSCAN...

DBSCAN when implemented naively is of complexity O(n^2). If you have a good index to support the queries, it can run in O(n log n). If you have a really bad implementation, it might drop down to O(n^3) because of inefficient data structures... (from a quick check, rapidminer should be in O(n^2), but may waste a lot of memory due to the use of LinkedList<Integer>, and the garbage collector will likely cost quite a bit)

5000 objects is actually not that much. So it might was well be that the problem is not the clustering algorithm complexity (I've ran DBSCAN on 100000 objects in 60 seconds on a single CPU with ELKI), but the distance computation. From a quick glance at rapidminer, I could not see if it has support for sparse vectors, efficient cosine similarity on sparse vectors, or even the trick of precomputing the vector length to fully exploit vector sparsity (at the cost of preprocessing each object and storing an additional double). Now obviously, if you have vectors with a sparsity of 1% as common with text vectors, using an inefficient non-sparse distance computation will take ~100x as long as necessary, computing lots of 0s.

I've just ran DBSCAN with ELKI on a text-like dataset. It's not as big as yours: just 3333 observations and 50 dimensions, and a sparsity of 13%). It takes 7 seconds to run DBSCAN with epsilon=0.001, minpts=3 and Cosine similarity; without index acceleration enabled. So it clearly seems as if the problem is not with DBSCAN, but with the RapidMiner implementation. Note that ELKI is a bit tricky to use with sparse data. You need to have the data in the right input format, set up a number of filters right etc. - I would not recommend it to beginners right now. This is more to highlight that it probably is the similarity function implementation of RapidMiner that kills your runtime.

I don't want to discourage you from DBSCAN. From a clustering point of view, it is one of the most appropriate algorithms you can find: it supports arbitrary distance functions (k-means doesn't), you don't need to know the number of clusters beforehand (which you apparently don't) and it will also opt out from clustering everything (as apparently you have a lot of non-cluster objects).

However, I believe you do not want to do clustering at all. You seem to be interested in duplicate detection, and I'd assume you get much better performance and result quality by e.g. loading your documents into a text search engine such as Lucene or Xapian, and querying the dedicated text search engine to see if you have near duplicates. And these text search engines are really good at this: matching text. Much better than rapidminer, I persume.

Другие советы

Your problem is not clustering but approximate string matching.

You may want to look into Levenshtein Edit Distances or use one of several existing options.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top