NLP - Improving Running Time and Recall of Fuzzy string matching

https://stackoverflow.com/questions/10290626

02-06-2021
|

Question

I have made a working algorithm but the running time is very horrible. Yes, I know from the start that it will be horrible but not that much. For just 200000 records, the program runs for more than an hour.

Basically what I am doing is:

for each searchfield in search fields
    for each sample in samples
        do a q-gram matching
    if there are matches then return it
    else
        split the searchfield into uniwords
        for each sample in samples
            split sample into uniwords
            for each uniword in samples
                if the uniword is a known abbreviation
                    then search the dictionary for its full word or other known abbr
                else do a jaro-winkler matching
            average the distances of all the uniwords
            if the average is above threshold then make it as a match and break
        end for
        if there is a match make a comment that it matched one of the samples partially
    end else
end for

Yes, this code is very loop-happy. I am using brute-force because the recall is very important. So, I'm wondering how can I make it faster since I am not only running it for 200000 data for millions of data and the computers of the client are not high-end (1GB-2GB of Ram Pentium 4 or Dual-Core, the computer where I test this program is a Dual Core with 4GB of Ram). I came across TF/IDF but I do not know if it will be sufficient. And I wonder how can google make searches real time.

Thanks in advance!

Edit: This program is a data filterer. From 200,000 dummy data (actual data is about 12M), I must filter data that is irrelevant to the samples (500 dummy samples, I still do not know how much the actual amount of samples).

With the given dummy data and samples, the running time is about 1 hour but after tinkering here and there, I have successfully lessen it to 10-15 minutes. I have lessen it by grouping the fields and samples that begin with the same character (discounting special and non-meaningful words e.g. the, a, an) and matching the fields to the sample with the same first character. I know there is a problem there. What if the field was misspelled at the first character? But I think the number of those are negligible. The samples are spelled correctly since it is always maintained.

No correct solution

OTHER TIPS

what is your programing language? I guess using q=2 or 3 is sufficient. Also I suggested to come from uni gram to higher degrees.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow