Javascript fuzzy search that makes sense [closed]

Question 1

Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.

It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.

I searched for "better than Levenshtein" and, among other things, found this:

http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/

This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:

Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?

Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.

I wrote some notes on Fuzzy String Search in SQL. See:

http://literatejava.com/sql/fuzzy-string-search-sql/

Question 2

I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime's search. https://github.com/farzher/fuzzysort

The only typo it allows is a transpose. It's pretty solid (1k stars, 0 issues), very fast, and handles your case easily:

fuzzysort.go('int', ['international', 'splint', 'tinder'])
// [{highlighted: '*int*ernational', score: 10}, {highlighted: 'spl*int*', socre: 3003}]

Question 3

Here is a technique I have used a few times...It gives pretty good results. Does not do everything you asked for though. Also, this can be expensive if the list is massive.

get_bigrams = (string) ->
    s = string.toLowerCase()
    v = new Array(s.length - 1)
    for i in [0..v.length] by 1
        v[i] = s.slice(i, i + 2)
    return v

string_similarity = (str1, str2) ->
    if str1.length > 0 and str2.length > 0
        pairs1 = get_bigrams(str1)
        pairs2 = get_bigrams(str2)
        union = pairs1.length + pairs2.length
        hit_count = 0
        for x in pairs1
            for y in pairs2
                if x is y
                    hit_count++
        if hit_count > 0
            return ((2.0 * hit_count) / union)
    return 0.0

Pass two strings to string_similarity which will return a number between 0 and 1.0 depending on how similar they are. This example uses Lo-Dash

Usage Example....

query = 'jenny Jackson'
names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']

results = []
for name in names
    relevance = string_similarity(query, name)
    obj = {name: name, relevance: relevance}
    results.push(obj)

results = _.first(_.sortBy(results, 'relevance').reverse(), 10)

console.log results

Also....have a fiddle

Make sure your console is open or you wont see anything :)

Question 4

this is my short and compact function for fuzzy match:

function fuzzyMatch(pattern, str) {
  pattern = '.*' + pattern.split('').join('.*') + '.*';
  const re = new RegExp(pattern);
  return re.test(str);
}

Question 5

I fixed the problems with the CoffeeScript bigram solution by InternalFx and made it a generic n-gram solution (you can customize the size of the grams).

This is TypeScript but you can remove the type annotations and it works fine as vanilla JavaScript as well.

/**
 * Compares the similarity between two strings using an n-gram comparison method. 
 * The grams default to length 2.
 * @param str1 The first string to compare.
 * @param str2 The second string to compare.
 * @param gramSize The size of the grams. Defaults to length 2.
 */
function stringSimilarity(str1: string, str2: string, gramSize: number = 2) {
  function getNGrams(s: string, len: number) {
    s = ' '.repeat(len - 1) + s.toLowerCase() + ' '.repeat(len - 1);
    let v = new Array(s.length - len + 1);
    for (let i = 0; i < v.length; i++) {
      v[i] = s.slice(i, i + len);
    }
    return v;
  }

  if (!str1?.length || !str2?.length) { return 0.0; }

  //Order the strings by length so the order they're passed in doesn't matter 
  //and so the smaller string's ngrams are always the ones in the set
  let s1 = str1.length < str2.length ? str1 : str2;
  let s2 = str1.length < str2.length ? str2 : str1;

  let pairs1 = getNGrams(s1, gramSize);
  let pairs2 = getNGrams(s2, gramSize);
  let set = new Set<string>(pairs1);

  let total = pairs2.length;
  let hits = 0;
  for (let item of pairs2) {
    if (set.delete(item)) {
      hits++;
    }
  }
  return hits / total;
}

Examples:

console.log(stringSimilarity("Dog", "Dog"))
console.log(stringSimilarity("WolfmanJackIsDaBomb", "WolfmanJackIsDaBest"))
console.log(stringSimilarity("DateCreated", "CreatedDate"))
console.log(stringSimilarity("a", "b"))
console.log(stringSimilarity("CreateDt", "DateCreted"))
console.log(stringSimilarity("Phyllis", "PyllisX"))
console.log(stringSimilarity("Phyllis", "Pylhlis"))
console.log(stringSimilarity("cat", "cut"))
console.log(stringSimilarity("cat", "Cnut"))
console.log(stringSimilarity("cc", "Cccccccccccccccccccccccccccccccc"))
console.log(stringSimilarity("ab", "ababababababababababababababab"))
console.log(stringSimilarity("a whole long thing", "a"))
console.log(stringSimilarity("a", "a whole long thing"))
console.log(stringSimilarity("", "a non empty string"))
console.log(stringSimilarity(null, "a non empty string"))

Try it in the TypeScript Playground

Question 6

you may take a look at Atom's https://github.com/atom/fuzzaldrin/ lib.

it is available on npm, has simple API, and worked ok for me.

> fuzzaldrin.filter(['international', 'splint', 'tinder'], 'int');
< ["international", "splint"]

Question 7

I've been in love with fuzzy matching for ages, and just ran across this thread. The conversation here is a lot further into the weeds than most, and looks to have involved implementers. I've coded several of these algorithms in different languages down the years, and want to pass along a few tips to anyone writing JS versions:

Monge-Elkan rules!

It's just fantastic, combining many of the strengths of n-grams with the best short-string comparison algorithms, such as Jaro-Winkler. (That's what I use in my Monge-Elkan code.) A couple of years back, I ran across a paper you can find on-line as a PDF named Generalized Mongue-Elkan Method for Approximate Text String Comparison. The take-away is that rather than using an arithmetic mean, use a quadratic mean. I tried it out, and it made a significant improvement in search results, across a wide variety of text.

N-Grams Rule!

Very robust, high-quality performance across a range of source languages and text types. If you're looking at databases, it is possible to implement this as a high-quality, lightning-fast, indexed K-NN search in Postgres. It takes lining up a few different features properly, but it's not too bad.

In any case, when splitting n-grams, there are different approaches to handling front-end padding. Like, if you've got a traditional n (q or k) of 3, then do you split 'ander' like this

'  a'
' an'
'and'
'nde'
'der'
'er '
'r  '

or

'  a'
' an'
'and'
'nde'
'der'

or

'and'
'nde'
'der'

Instinctively, I've always expected the first list to work best but, in practice, it can be the second or third. It's worth experimenting with the padding and windowing rules, and see how they perform in your context. Few libraries provide control over this behavior, which would be a nice feature to support. Hint.

Question 8

here is the solution provided by @InternalFX, but in JS (I used it so sharing):

function get_bigrams(string){
  var s = string.toLowerCase()
  var v = s.split('');
  for(var i=0; i<v.length; i++){ v[i] = s.slice(i, i + 2); }
  return v;
}

function string_similarity(str1, str2){
  if(str1.length>0 && str2.length>0){
    var pairs1 = get_bigrams(str1);
    var pairs2 = get_bigrams(str2);
    var union = pairs1.length + pairs2.length;
    var hits = 0;
    for(var x=0; x<pairs1.length; x++){
      for(var y=0; y<pairs2.length; y++){
        if(pairs1[x]==pairs2[y]) hits++;
    }}
    if(hits>0) return ((2.0 * hits) / union);
  }
  return 0.0
}

Question 9

November 2019 Update. I found fuse to have some pretty decent upgrades. However, I could not get it to use bool's (i.e. OR, AND, etc operators) nor could I use the API search interface to filter results.

I discovered nextapps-de/flexsearch: https://github.com/nextapps-de/flexsearch and I believe it by far surpasses a lot of the other javascript search libraries that I've tried, and it has support bool's, filtering searches & pagination.

You can input a list of javascript objects for your search data (i.e. storage), and the API is fairly well documented: https://github.com/nextapps-de/flexsearch#api-overview

So far I've indexed close to 10,000 records, and my searches are next to immediate; i.e. unnoticeable amount of time for each search.

Question 10

(function (int) {
    $("input[id=input]")
        .on("input", {
        sort: int
    }, function (e) {
        $.each(e.data.sort, function (index, value) {
          if ( value.indexOf($(e.target).val()) != -1 
              && value.charAt(0) === $(e.target).val().charAt(0) 
              && $(e.target).val().length === 3 ) {
                $("output[for=input]").val(value);
          };
          return false
        });
        return false
    });
}(["international", "splint", "tinder"]))

jsfiddle http://jsfiddle.net/guest271314/QP7z5/

Question 11

Fuzzy Sort is a javascript library is helpful to perform string matching from a large collection of data.

The following code will helpful to use fuzzy sort in react.js.

install fuzzy sort through npm,
```
npm install fuzzysort
```
Make a reference variable,
```
const fuzzysort = require('fuzzysort')
```

Use go() method to find matched strings

search(keyword, category) {  
  return fuzzysort.go(keyword, data[category]);
}

Full demo code in react.js

import React from 'react';
import './App.css';
import data from './testdata';
const fuzzysort = require('fuzzysort');

class App extends React.Component {
  constructor(props){
    super(props)
    this.state = {
      keyword: '',
      results: [],
    }
    console.log("data: ", data["steam_games"]);
  }

  search(keyword, category) {  
    return fuzzysort.go(keyword, data[category]);
  }

  render(){
    return (
      <div className="App">
        <input type="text" onChange={(e)=> this.setState({keyword: e.target.value})}
          value={this.state.keyword}
        />
        <button onClick={()=>this.setState({results: this.search(this.state.keyword, "steam_games")})}>Search</button>
        {this.state.results !== null && this.state.results.length > 0 ?
          <h3>Results:</h3> : null
        }
        <ul>
        {this.state.results.map((item, index) =>{
            return(
              <li key={index}>{item.score} : {item.target}</li>
            )
          })
        }
        </ul>
      </div>
    );
  }
}

export default App;

For more refer FuzzySort

Question 12

This could be achieved by using Regex.

Example:

  const fuzzySearch = (list, searchValue) => {
    let buf = ".*" + searchValue.replace(/(.)/g, "$1.*").toLowerCase();
    var reg = new RegExp(buf);
    let newList = list.filter(function (e) {
      return reg.test(e.title.toLowerCase());
    });
    return newList;
  };

Working example: https://codesandbox.io/s/jovial-fermat-cilh1?file=/src/App.js:28894-29167

Javascript fuzzy search that makes sense [closed]

Edit

Monge-Elkan rules!

N-Grams Rule!