Question

I have an input stream containing integers and special meaning characters '#'. It looks as follows: ... 12 18 16 # 22 24 26 15 # 17 # 32 35 33 ... The tokens are separated by space. There's no pattern for the position of '#'.

I was trying to tokenize the input stream like this:

int value;
std::ifstream input("data");
if (input.good()) {
  string line;
  while(getline(data, line) != EOF) {
    if (!line.empty()) {
      sstream ss(line);
      while (ss >> value) {
        //process value ...

      }
    }
  }
}

The problem with this code is that the processing stops when the first '#' is encountered.

The only solution I can think of is to extract each individual token into a string (not '#') and use atoi() function to convert the string to an integer. However, it's very inefficient as the majority tokens are integer. Calling atoi() on the tokens introduces big overhead.

Is there a way I can parse the individual token by its type? ie, for integers, parse it as integers while for '#', skip it. Thanks!

Was it helpful?

Solution

One possibility would be to explicitly skip whitespace (ss >> std::ws), and then to use ss.peek() to find out if a # follows. If yes, use ss.get() to read it and continue, otherwise use ss >> value to read the value.

If the positions of # don't matter, you could also remove all '#' from the line before initializing the stringstream with it.

OTHER TIPS

Usually not worth testing against good()

if (input.good()) {

Unless your next operation is generating an error message or exception. If it is not good all further operations will fail anyway.

Don't test against EOF.

while(getline(data, line) != EOF) {

The result of std::getline() is not an integer. It is a reference to the input stream. The input stream is convertible to a bool like object that can be used in bool a context (like while if etc..). So what you want to do:

while(getline(data, line)) {

I am not sure I would read a line. You could just read a word (since the input is space separated). Using the >> operator on string

std::string word;
while(data >> word) {  // reads one space separated word

Now you can test the word to see if it is your special character:

if (word[0] == "#")

If not convert the word into a number.

This is what I would do:

// define a class that will read either value from a stream
class MyValue
{
  public:
    bool isSpec() const {return isSpecial;}
    int  value()  const {return intValue;}

    friend std::istream& operator>>(std::istream& stream, MyValue& data)
    {
        std::string item;
        stream >> item;
        if (item[0] == '#') {
            data.isSpecial = true;
        } else
        {   data.isSpecial = false;
            data.intValue  = atoi(&item[0]);
        }
        return stream;
    }
  private:
    bool isSpecial;
    int  intValue;
};

// Now your loop becomes:
MyValue  val;
while(file >> val)
{
    if (val.isSpec())  { /* Special processing */ }
    else               { /* We have an integer */ }
}

Maybe you can read all values as std::string and then check if it's "#" or not (and if not - convert to int)

int value;
std::ifstream input("data");
if (input.good()) {
    string line;
    std::sstream ss(std::stringstream::in | std::stringstream::out);
    std::sstream ss2(std::stringstream::in | std::stringstream::out);
    while(getline(data, line, '#') {
        ss << line;
        while(getline(ss, line, ' ') {
            ss2 << line;
            ss2 >> value
            //process values ...
            ss2.str("");  
        }
        ss.str("");
    }
}

In here we first split the line by the token '#' in the first while loop then in the second while loop we split the line by ' '.

Personally, if your separator is always going to be space regardless of what follows, I'd recommend you just take the input as string and parse from there. That way, you can take the string, see if it's a number or a # and whatnot.

I think you should re-examine your premise that "Calling atoi() on the tokens introduces big overhead-"

There is no magic to std::cin >> val. Under the hood, it ends up calling (something very similar to) atoi.

If your tokens are huge, there might be some overhead to creating a std::string but as you say, the vast majority are numbers (and the rest are #'s) so they should mostly be short.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top