Question

I want to trace input position and input line for unicode strings.

For the position I store an iterator to begin and use std::distance at the desired position. That works well as long as the input is not unicode. With unicode symbols the position gets shifted, i.e. ä takes two spaces in the input stream and position is off by 1. So, I switched to boost::u8_to_u32_iterator and this works fine.

For the line I use boost::spirit::line_pos_iterator which also works well.

My problem is in combining both concepts to use the line iterator and the unicode iterator. Another solution allowing pos and line on unicode strings is of course also welcome.

Here is a small example for the unicode parser; as said I would like to wrap the iterator additionally with boost::spirit::line_pos_iterator but that doesn't even compile.

#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>

#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace phx = boost::phoenix;

#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>

#include <iostream>
#include <string>

//==============================================================================
std::string to_utf8(const std::u32string& input) {
  return std::string(
      boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
      boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}

//==============================================================================
int main() {
  std::string input(u8"Hallo äöüß");

  typedef boost::u8_to_u32_iterator<std::string::const_iterator> iterator_type;

  iterator_type first(input.begin()), last(input.end());

  qi::rule<iterator_type, std::u32string()> string_u32 = *(qi::char_ - qi::eoi);

  qi::rule<iterator_type, std::string()> string =
      string_u32[qi::_val = phx::bind(&to_utf8, qi::_1)];

  qi::rule<iterator_type, std::string()> rule = string;

  std::string ast;
  bool result = qi::parse(first, last, rule, ast);
  if (result) {
    result = first == last;
  }

  if (result) {
    std::cout << "Parsed: " << ast << std::endl;
  } else {
    std::cout << "Failure" << std::endl;
  }
}
Was it helpful?

Solution

Update Demo added Live on Coliru

I see the same problem whe you try to wrap iterator_type in a line_pos_iterator.

After some thinking, I don't quite know what causes it (it might be possible to get around this by wrapping the u8_to_u32 converting iterator adapter inside a boost::spirit::multi_pass<> iterator adapter, but... that sounded so unwieldy I haven't even tried).

Instead, I think that the nature of line-breaking is that it is (mostly?) charset agnostic. So you could wrap the source iterator with line_pos_iterator first, before the encoding conversion.

This does compile. Of course, then you'll get position information in terms of the source iterators, not 'logical characters'[1].

Let me show a demonstration below. It parses space separated words into a vector of strings. The simplest way to show the position information was to use a vector of iterator_ranges instead of just strings. I used qi::raw[] to expose the iterators[2].

So after a successful parse I loop through the matched ranges and print their location information. First, I print the actual positions reported from line_pos_iterators. Remember these are 'raw' byte offsets, since the source iterator is byte-oriented.

Next, I do a little dance with get_current_line and the u8_to_u32 conversion to translate the offset within the line to a (more) logical count. You'll see that the range for e.g.

Note I currently assumed that ranges would not cross line boundaries (that is true for this grammar). Otherwise one would need to extract and convert 2 lines. The way I'm doing that now is rather expensive. Consider optimizing by e.g. using Boost String Algorithm's find_all facilities. You can build a list of line-ends and use std::lower_bound to locate the current line slightly more efficiently.

Note There might be issues with the implementations of get_line_start and get_current_line; if you notice anything like this, there's a 10-line patch over at the [spirit-general] user list that you could try

Without further ado, the code and the output:

#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/phoenix/function/adapt_function.hpp>

namespace phx = boost::phoenix;

#include <boost/spirit/include/qi.hpp>

namespace qi       = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;

#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>

#include <iostream>
#include <string>

//==============================================================================
std::string to_utf8(const std::u32string& input) {
  return std::string(
      boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
      boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}

BOOST_PHOENIX_ADAPT_FUNCTION(std::string, to_utf8_, to_utf8, 1)

//==============================================================================
int main() {
    std::string input(u8"Hallo äöüß\n¡Bye! ✿➂➿♫");

    typedef boost::spirit::line_pos_iterator<std::string::const_iterator> source_iterator;

    typedef boost::u8_to_u32_iterator<source_iterator> iterator_type;

    source_iterator soi(input.begin()), 
                    eoi(input.end());
    iterator_type   first(soi), 
                    last(eoi);

    qi::rule<iterator_type, std::u32string()> string_u32 = +encoding::graph;
    qi::rule<iterator_type, std::string()>    string     = string_u32 [qi::_val = to_utf8_(qi::_1)];

    std::vector<boost::iterator_range<iterator_type> > ast;
    // note the trick with `raw` to expose the iterators
    bool result = qi::phrase_parse(first, last, *qi::raw[ string ], encoding::space, ast);

    if (result) {
        for (auto const& range : ast)
        {
            source_iterator 
                base_b(range.begin().base()), 
                base_e(range.end().base());
            auto lbound = get_line_start(soi, base_b);

            // RAW access to the base iterators:
            std::cout << "Fragment: '" << std::string(base_b, base_e) << "'\t" 
                << "raw: L" << get_line(base_b) << ":" << get_column(lbound, base_b, /*tabs:*/4)
                <<     "-L" << get_line(base_e) << ":" << get_column(lbound, base_e, /*tabs:*/4);

            // "cooked" access:
            auto line = get_current_line(lbound, base_b, eoi);
            // std::cout << "Line: '" << line << "'\n";

            // iterator_type is an alias for u8_to_u32_iterator<...>
            size_t cur_pos = 0, start_pos = 0, end_pos = 0;
            for(iterator_type it = line.begin(), _eol = line.end(); ; ++it, ++cur_pos)
            {
                if (it.base() == base_b) start_pos = cur_pos;
                if (it.base() == base_e) end_pos   = cur_pos;

                if (it == _eol)
                    break;
            }
            std::cout << "\t// in u32 code _units_: positions " << start_pos << "-" << end_pos << "\n";
        }
        std::cout << "\n";
    } else {
        std::cout << "Failure" << std::endl;
    }

    if (first!=last)
    {
        std::cout << "Remaining: '" << std::string(first, last) << "'\n";
    }
}

The output:

clang++ -std=c++11 -Os main.cpp && ./a.out
Fragment: 'Hallo'   raw: L1:1-L1:6  // in u32 code _units_: positions 0-5
Fragment: 'äöüß'    raw: L1:7-L1:15 // in u32 code _units_: positions 6-10
Fragment: '¡Bye!'   raw: L2:2-L2:8  // in u32 code _units_: positions 1-6
Fragment: '✿➂➿♫'    raw: L2:9-L2:21 // in u32 code _units_: positions 7-11

[1] I think there's not a useful definition of what a character is in this context. There's bytes, code units, code points, grapheme clusters, possibly more. Suffice it to say that the source iterator (std::string::const_iterator) deals with bytes (since it is charset/encoding unaware). In u32string you can /almost/ assume that a single position is roughly a code-point (although I think (?) that for >L2 UNICODE support you still would have to support code points combined from multiple code units).

[2] This means that current the attribute conversion and the semantic action are redundant, but you'll get that :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top