Boost Spirit Qi track line and parse unicode

Question

Update Demo added Live on Coliru

I see the same problem whe you try to wrap iterator_type in a line_pos_iterator.

After some thinking, I don't quite know what causes it (it might be possible to get around this by wrapping the u8_to_u32 converting iterator adapter inside a boost::spirit::multi_pass<> iterator adapter, but... that sounded so unwieldy I haven't even tried).

Instead, I think that the nature of line-breaking is that it is (mostly?) charset agnostic. So you could wrap the source iterator with line_pos_iterator first, before the encoding conversion.

This does compile. Of course, then you'll get position information in terms of the source iterators, not 'logical characters'^[1].

Let me show a demonstration below. It parses space separated words into a vector of strings. The simplest way to show the position information was to use a vector of iterator_ranges instead of just strings. I used qi::raw[] to expose the iterators^[2].

So after a successful parse I loop through the matched ranges and print their location information. First, I print the actual positions reported from line_pos_iterators. Remember these are 'raw' byte offsets, since the source iterator is byte-oriented.

Next, I do a little dance with get_current_line and the u8_to_u32 conversion to translate the offset within the line to a (more) logical count. You'll see that the range for e.g.

Note I currently assumed that ranges would not cross line boundaries (that is true for this grammar). Otherwise one would need to extract and convert 2 lines. The way I'm doing that now is rather expensive. Consider optimizing by e.g. using Boost String Algorithm's find_all facilities. You can build a list of line-ends and use std::lower_bound to locate the current line slightly more efficiently.

Note There might be issues with the implementations of get_line_start and get_current_line; if you notice anything like this, there's a 10-line patch over at the [spirit-general] user list that you could try

Without further ado, the code and the output:

#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/phoenix/function/adapt_function.hpp>

namespace phx = boost::phoenix;

#include <boost/spirit/include/qi.hpp>

namespace qi       = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;

#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>

#include <iostream>
#include <string>

//==============================================================================
std::string to_utf8(const std::u32string& input) {
  return std::string(
      boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
      boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}

BOOST_PHOENIX_ADAPT_FUNCTION(std::string, to_utf8_, to_utf8, 1)

//==============================================================================
int main() {
    std::string input(u8"Hallo äöüß\n¡Bye! ✿➂➿♫");

    typedef boost::spirit::line_pos_iterator<std::string::const_iterator> source_iterator;

    typedef boost::u8_to_u32_iterator<source_iterator> iterator_type;

    source_iterator soi(input.begin()), 
                    eoi(input.end());
    iterator_type   first(soi), 
                    last(eoi);

    qi::rule<iterator_type, std::u32string()> string_u32 = +encoding::graph;
    qi::rule<iterator_type, std::string()>    string     = string_u32 [qi::_val = to_utf8_(qi::_1)];

    std::vector<boost::iterator_range<iterator_type> > ast;
    // note the trick with `raw` to expose the iterators
    bool result = qi::phrase_parse(first, last, *qi::raw[ string ], encoding::space, ast);

    if (result) {
        for (auto const& range : ast)
        {
            source_iterator 
                base_b(range.begin().base()), 
                base_e(range.end().base());
            auto lbound = get_line_start(soi, base_b);

            // RAW access to the base iterators:
            std::cout << "Fragment: '" << std::string(base_b, base_e) << "'\t" 
                << "raw: L" << get_line(base_b) << ":" << get_column(lbound, base_b, /*tabs:*/4)
                <<     "-L" << get_line(base_e) << ":" << get_column(lbound, base_e, /*tabs:*/4);

            // "cooked" access:
            auto line = get_current_line(lbound, base_b, eoi);
            // std::cout << "Line: '" << line << "'\n";

            // iterator_type is an alias for u8_to_u32_iterator<...>
            size_t cur_pos = 0, start_pos = 0, end_pos = 0;
            for(iterator_type it = line.begin(), _eol = line.end(); ; ++it, ++cur_pos)
            {
                if (it.base() == base_b) start_pos = cur_pos;
                if (it.base() == base_e) end_pos   = cur_pos;

                if (it == _eol)
                    break;
            }
            std::cout << "\t// in u32 code _units_: positions " << start_pos << "-" << end_pos << "\n";
        }
        std::cout << "\n";
    } else {
        std::cout << "Failure" << std::endl;
    }

    if (first!=last)
    {
        std::cout << "Remaining: '" << std::string(first, last) << "'\n";
    }
}

The output:

clang++ -std=c++11 -Os main.cpp && ./a.out
Fragment: 'Hallo'   raw: L1:1-L1:6  // in u32 code _units_: positions 0-5
Fragment: 'äöüß'    raw: L1:7-L1:15 // in u32 code _units_: positions 6-10
Fragment: '¡Bye!'   raw: L2:2-L2:8  // in u32 code _units_: positions 1-6
Fragment: '✿➂➿♫'    raw: L2:9-L2:21 // in u32 code _units_: positions 7-11

^[1] I think there's not a useful definition of what a character is in this context. There's bytes, code units, code points, grapheme clusters, possibly more. Suffice it to say that the source iterator (std::string::const_iterator) deals with bytes (since it is charset/encoding unaware). In u32string you can /almost/ assume that a single position is roughly a code-point (although I think (?) that for >L2 UNICODE support you still would have to support code points combined from multiple code units).

^[2] This means that current the attribute conversion and the semantic action are redundant, but you'll get that :)