Question

I am playing around with the boost strings library and have just come across the awesome simplicity of the split method.

  string delimiters = ",";
  string str = "string, with, comma, delimited, tokens, \"and delimiters, inside a quote\"";
  // If we didn't care about delimiter characters within a quoted section we could us
  vector<string> tokens;  
  boost::split(tokens, str, boost::is_any_of(delimiters));
  // gives the wrong result: tokens = {"string", " with", " comma", " delimited", " tokens", "\"and delimiters", " inside a quote\""}

Which would be nice and concise... however it doesn't seem to work with quotes and instead I have to do something like the following

string delimiters = ",";
string str = "string, with, comma, delimited, tokens, \"and delimiters, inside a quote\"";
vector<string> tokens; 
escaped_list_separator<char> separator("\\",delimiters, "\"");
typedef tokenizer<escaped_list_separator<char> > Tokeniser;
Tokeniser t(str, separator);
for (Tokeniser::iterator it = t.begin(); it != t.end(); ++it)
    tokens.push_back(*it);
// gives the correct result: tokens = {"string", " with", " comma", " delimited", " tokens", "\"and delimiters, inside a quote\""}

My question is can split or another standard algorithm be used when you have quoted delimiters? Thanks to purpledog but I already have a non-deprecated way of achieving the desired outcome, I just think that it's quite cumbersome and unless I could replace it with a simpler more elegant solution I wouldn't use it in general without first wrapping it in yet another method.

EDIT: Updated code to show results and clarify question.

Was it helpful?

Solution

It doesn't seem that there is any simple way to do this using the boost::split method. The shortest piece of code I can find to do this is

vector<string> tokens; 
tokenizer<escaped_list_separator<char> > t(str, escaped_list_separator<char>("\\", ",", "\""));
BOOST_FOREACH(string s, escTokeniser)
    tokens.push_back(s);  

which is only marginally more verbose than the original snippet

vector<string> tokens;  
boost::split(tokens, str, boost::is_any_of(","));

OTHER TIPS

This will achieve the same result as Jamie Cook's answer without the explicit loop.

tokenizer<escaped_list_separator<char> >tok(str);
vector<string> tokens( tok.begin(), tok.end() );

The tokenizer constructor's second parameter defaults to escaped_list_separator<char>("\\", ",", "\"") so it's not necessary. Unless you have differing requirements for commas or quotes.

I don't know about the boost::string library but using the boost regex_token_iterator you'll be able to express delimiters in terms of regular expression. So yes, you can use quoted delimiters, and far more complex things as well.

Note that this used to be done with regex_split which is now deprecated.

Here's an example taken from the boost doc:

#include <iostream>
#include <boost/regex.hpp>

using namespace std;

int main(int argc)
{
   string s;
   do{
      if(argc == 1)
      {
         cout << "Enter text to split (or \"quit\" to exit): ";
         getline(cin, s);
         if(s == "quit") break;
      }
      else
         s = "This is a string of tokens";

      boost::regex re("\\s+");
      boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
      boost::sregex_token_iterator j;

      unsigned count = 0;
      while(i != j)
      {
         cout << *i++ << endl;
         count++;
      }
      cout << "There were " << count << " tokens found." << endl;

   }while(argc == 1);
   return 0;
}

If the program is started with hello world as argument the output is:

hello
world
There were 2 tokens found.

Changing boost::regex re("\s+"); into boost::regex re("\",\""); would split quoted delimiters. starting the program with hello","world as argument would also result in:

hello
world
There were 2 tokens found.

But I suspect you want to deal with things like that: "hello", "world", in which case one solution is:

  1. split with coma only
  2. then remove the "" (possibly using boost/algorithm/string/trim.hpp or the regex library).

EDIT: added program output

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top