Question

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get

  • "Apo Hiking Society"
  • "Hiking"
  • "Hiking Georgia"
  • "Hiking Virginia's national forests"
  • "Hiking trail"
Was it helpful?

Solution

Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.

The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).

How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).

So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?

BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive

OTHER TIPS

Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.

For the algorithm side, I find the following paper a good overview

Zobel and Moffat (2006): "Inverted Files for Text Search Engines".

Most likely it's a trie with lexicographical order.

There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.

An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top