Efficient way to check if a given string is equivalent to at least one string in the given set of strings

StackOverflow https://stackoverflow.com/questions/20791550

Question

Given a set of strings, say "String1", "String2",..., "StringN", what is the most efficient way in C++ to determine (return true or false) whether given string s matches any of the strings in the above set?

Can Boost.Regex be used for this task?

Was it helpful?

Solution

std::unordered_set would provide the most efficient look-up (amortized constant time).

#include <unordered_set>
#include <string>
#include <cassert>

int main() {
    std::unordered_set<std::string> s = {"Hello", "Goodbye", "Good morning"};
    assert(s.find("Goodbye") != s.end());
    assert(s.find("Good afternoon") == s.end());
    return 0;
}

OTHER TIPS

You can put all your strings in a std::set and then check if that string is in the set.

In some situations in which you frequently need to perform these checks, a great performance improvement can be obtained from Interning.

Interning still requires us to have some string lookup data structure, such as a tree or hash table. However, we do these heavy lookups more rarely: specificaly, we do them only whenever some raw textual input arrives from the environment into our software system.

At that time, we take the input text as a character string and intern it: look it up in the existing set of strings, and convert it to an atom. An atom is small unit of data, typically a single-word quantity such as a machine pointer. If the string doesn't exist, the interning function gives us a new, unique atom. Otherwise, it gives us the same atom that it gave us previously for that string.

Once we have interned input strings to atoms, we always use the atoms in their place. So instead of comparing whether two strings are the same, we compare whether two atoms are the same, which is a "blindingly fast" single-word comparison operation. (We still use the the strings when we need to print atoms in a readable way: again at the boundary between our system and the outside world).

Interning comes from Lisp: in Lisp symbols are atoms. In the raw source code, they are textual, and so when code is read into memory, symbol names are interned to produce atoms which are basically pointers to symbol objects.

Interning crops up in other places such as the X Window system (XInternAtom function):

Atom XInternAtom(Display *display, char *atom_name, Bool only_if_exists)

and in the Microsoft Windows API where the term "intern" is not used, but the function returns something called an ATOM: Lisp terminology. What is interned is not a simple string but a "class" strucure:

ATOM WINAPI RegisterClass(const WNDCLASS *lpWndClass);

In both systems, these atoms are system-wide (server-wide in the case of X) and can be compared for equality in place of the objects they represent. In Windows if you have two ATOM values which are equal, they are the same class.

The Flyweight Design Pattern from the GoF book is essentially a reinvention of interning, extended to structures other than strings (like WNDCLASS in the above Win32 API); so if you want to "sell" the idea to your boss, you can take it from that angle.

An alternative is to construct a N-ary tree to store all strings.

Node look like:

struct Node {
    Node* children[256]; // Or reduce to correct accepted char
    bool isAWord; // true when the letters from the root forms a word
}

So, with "String1", "String10", "String2", "StringN", Tree is:

     Root
      |
     [S]
      |
     [t]
      |
     [r]
      |
     [i]
      |
     [n]
      |
     [g]
    / |  \
[1*] [2*] [N*]
 |
[0*]

Once you have your tree, look into it to see if the string matches.
Search Complexity: size of the string to search.

As an alternative you can define an ordered array of character arrays or strings and use standard algorithms std::binary_search, std::lower_bound, std::upper_bound or std::equal_range to check whether the target string is present in the array.

I will use the example already shown here.

#include <algorithm>
#include <iterator>
#include <string>
#include <iostream>
#include <iomanip>

int main() 
{
    const char * s[] = { "Good morning", "Goodbye", "Hello" };
    std::cout << std::boolalpha 
              << std::binary_search( std::begin( s ), std::end( s ), std::string( "Goodbye" ) ) 
              << std::endl;
    std::cout << std::boolalpha 
              << std::binary_search( std::begin( s ), std::end( s ), std::string( "Good afternoon" ) ) 
              << std::endl;
    return 0;
}

Assuming the strings are not entirely random and my have prefixes in common, it may be more efficient to use a Trie: initially building the data structure may be more expensive than creating other containers but if there are many queries made against the set of strings this could pay off. The main disadvantage is that there is no trie implementation in the standard C++ library.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top