Question

I used to have some code in C++ which stores strings as a series of characters in a character matrix (a string is a row). The classes Character matrix and LogicalVector are provided by Rcpp.h:

LogicalVector unq_mat( CharacterMatrix x ){
  int nc = x.ncol() ; // Get the number of columns in the matrix. 
  LogicalVector out(nc); // Make a logical (bool) vector of the same length.
  // For every col in the matrix, assess whether the column contains more than one unique character.
  for( int i=0; i < nc; i++ ) {
    out[i] = unique( x(_,i) ).size() != 1 ;
    }
  return out;
}

The logical vector identifies which columns contain more than one unique character. This is then passed back to the R language and used to manipulate a matrix. This is a very R way of thinking of doing this. However I'm interested in developing my thinking in C++, I'd like to write something that achieves the above: So finds out which characters in n strings are not all the same, but preferably using the stl classes like std::string. As a conceptual example given three strings: A = "Hello", B = "Heleo", C = "Hidey". The code would point out that positions/characters 2,3,4,5 are not one value, but position/character 1 (the 'H') is the same in all strings (i.e. there is only one unique value). I have something below that I thought worked:

std::vector<int> StringsCompare(std::vector<string>& stringVector) {
    std::vector<int> informative;
    for (int i = 0; i < stringVector[0].size()-1; i++) {
        for (int n = 1; n < stringVector.size()-1; n++) {
            if (stringVector[n][i] != stringVector[n-1][i]) {
                informative.push_back(i);
                break;
            }
        }
    }
    return informative;
}

It's supposed to go through every character position (0 to size of string-1) with the outer loop, and with the inner loop, see if the character in string n is not the same as the character in string n-1. In cases where the character is all the same, for example the H in my hello example above, this will never be true. For cases where the characters in the strings are different the inter loops if statement will be satisfied, the character position recorded, and the inner loop broken out of. I then get a vector out containing the indicies of the characters in the n strings where the characters are not all identical. However these two functions give me different answers. How else can I go through n strings char by char and check they are not all identical?

Thanks, Ben.

Was it helpful?

Solution

I expected @doctorlove to provide an answer. I'll enter one here in case he does not.

To iterate through all of the elements of a string or vector by index, you want i from 0 to size()-1. for (int i=0; i<str.size(); i++) stops just short of size, i.e., stops at size()-1. So remove the -1's.

Second, C++ arrays are 0-based, so you must adjust (by adding 1 to the value that is pushed into the vector).

std::vector<int> StringsCompare(std::vector<std::string>& stringVector) {
    std::vector<int> informative;
    for (int i = 0; i < stringVector[0].size(); i++) {
        for (int n = 1; n < stringVector.size(); n++) {
            if (stringVector[n][i] != stringVector[n-1][i]) {
                informative.push_back(i+1);
                break;
            }
        }
    }
    return informative;
}

A few things to note about this code:

  1. The function should take a const reference to vector, as the input vector is not modified. Not really a problem here, but for various reasons, it's a good idea to declare unmodified input references as const.
  2. This assumes that all the strings are at least as long as the first. If that doesn't hold, the behavior of the code is undefined. For "production" code, you should include a check for the length prior to extracting the ith element of each string.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top