Pergunta

My department handles the collection and display of data from a wide range of intra-company sources for use in data-mining/company dashboards.

One large challenge we have is cross-referencing location names across various departments. We are a rather large organization, and departments with different interests do their own reporting for any one location. In general there is alot of discrepancy in the EXACT name that a location name has in the reporting across those departments. For instance, a location may be referred to as:

  • The Fabulous Restaurant
  • fabulous restaurant
  • Fabulous F&B
  • When the location goes through some renovation... Fabulous Cafe'
  • or even Profit Center 12345ABC

So my question is what best practices exist in reconciling these names in our own database and code? Let's assume for the moment that my department does not have the ability to unite the organization under a common hierarchy standard (which would be the optimal solution). At the moment our practice is to maintain ever growing reference tables of location names which are then referenced back into our own naming standard. This allows us to maintain historical consistency with our data.

Is it feasible/advisable to implement some kind of "fuzzy search" when cross-referencing locations? Something, for instance, that might ignore instances of words like "the", or treat "cafe'" and "restaurant" equally (based on some pre defined logic).

I certainly don't think we would ever be able to algorithmically account for ALL of the random naming conventions we encounter, but is it enough to be able to account for some/most of them?

Foi útil?

Solução

The keyword is . retagged. Fuzzy search is common in , and definitely useful here. But the examples you gave might be a bit too hard for automatic integration, you'll need user intervention and proper .

I've successfully used fuzzy matching to re-import music playlists. Even from the internet. Title and Artist usually provide enough data to do a rather reliable fuzzy matching to my music collection (at least if I have the song).

However, fuzzy matching will not be reliable if you have just a single word essentially, as in your "fabulous restaurant" example.

A good fuzzy matching will use stemming and have a notion of common words and synonyms. So "restaurant" and "cafe" will probably be not considered significant. The key point then is to have enough data. A single word will probably not be enough to identify locations.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top