使用 escaped_list_separator 和 boost split

https://stackoverflow.com/questions/890895

23-08-2019
|

题

我正在使用 boost 字符串库，刚刚发现 split 方法非常简单。

  string delimiters = ",";
  string str = "string, with, comma, delimited, tokens, \"and delimiters, inside a quote\"";
  // If we didn't care about delimiter characters within a quoted section we could us
  vector<string> tokens;  
  boost::split(tokens, str, boost::is_any_of(delimiters));
  // gives the wrong result: tokens = {"string", " with", " comma", " delimited", " tokens", "\"and delimiters", " inside a quote\""}

这会很好而且简洁......但是它似乎不适用于引号，而是我必须执行类似以下操作

string delimiters = ",";
string str = "string, with, comma, delimited, tokens, \"and delimiters, inside a quote\"";
vector<string> tokens; 
escaped_list_separator<char> separator("\\",delimiters, "\"");
typedef tokenizer<escaped_list_separator<char> > Tokeniser;
Tokeniser t(str, separator);
for (Tokeniser::iterator it = t.begin(); it != t.end(); ++it)
    tokens.push_back(*it);
// gives the correct result: tokens = {"string", " with", " comma", " delimited", " tokens", "\"and delimiters, inside a quote\""}

我的问题是当您引用分隔符时可以使用 split 或其他标准算法吗？感谢 Purpledog，但我已经有了一种未弃用的方法来实现所需的结果，我只是认为它非常麻烦，除非我可以用更简单、更优雅的解决方案替换它，否则我一般不会在不先将其包装起来的情况下使用它还有另一种方法。

编辑：更新了代码以显示结果并澄清问题。

解决方案

这似乎并不认为有什么简单的方法来做到这一点使用了boost :: split方法。我能找到要做到这一点的最短一段代码是

vector<string> tokens; 
tokenizer<escaped_list_separator<char> > t(str, escaped_list_separator<char>("\\", ",", "\""));
BOOST_FOREACH(string s, escTokeniser)
    tokens.push_back(s);

，其仅比原片段稍微更详细的

vector<string> tokens;  
boost::split(tokens, str, boost::is_any_of(","));

其他提示

这将实现同样的结果杰米·库克的回答没有明确的循环。

tokenizer<escaped_list_separator<char> >tok(str);
vector<string> tokens( tok.begin(), tok.end() );

在标记生成器的构造的第二个参数的缺省值为escaped_list_separator<char>("\\", ",", "\"")所以它不是必需的。除非你有一个逗号，引号不同的要求。

我不知道 boost::string 库，但使用 boost regex_token_iterator 您将能够用正则表达式表达分隔符。所以是的，您可以使用引号分隔符，也可以使用更复杂的东西。

请注意，这过去是通过 regex_split 完成的，但现在已弃用。

这是来自 boost 文档的示例：

#include <iostream>
#include <boost/regex.hpp>

using namespace std;

int main(int argc)
{
   string s;
   do{
      if(argc == 1)
      {
         cout << "Enter text to split (or \"quit\" to exit): ";
         getline(cin, s);
         if(s == "quit") break;
      }
      else
         s = "This is a string of tokens";

      boost::regex re("\\s+");
      boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
      boost::sregex_token_iterator j;

      unsigned count = 0;
      while(i != j)
      {
         cout << *i++ << endl;
         count++;
      }
      cout << "There were " << count << " tokens found." << endl;

   }while(argc == 1);
   return 0;
}

如果程序以 你好世界 作为参数，输出是：

hello
world
There were 2 tokens found.

改变 boost::正则表达式 re("\s+"); 进入 boost::正则表达式 re("\",\""); 将分割引用的分隔符。启动程序 你好世界 因为参数也会导致：

hello
world
There were 2 tokens found.

但我怀疑你想要处理这样的事情： “你好世界”, ，在这种情况下，一种解决方案是：

仅用彗差分割
然后删除“”（可能使用boost/algorithm/string/trim.hpp或正则表达式库）。

编辑：添加程序输出

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow