overriding ctype<wchar_t>

https://stackoverflow.com/questions/2339593

22-09-2019
|

Question

I'm writing a lambda calculus interpreter for fun and practice. I got iostreams to properly tokenize identifiers by adding a ctype facet which defines punctuation as whitespace:

struct token_ctype : ctype<char> {
 mask t[ table_size ];
 token_ctype()
 : ctype<char>( t ) {
  for ( size_t tx = 0; tx < table_size; ++ tx ) {
   t[tx] = isalnum( tx )? alnum : space;
  }
 }
};

(classic_table() would probably be cleaner but that doesn't work on OS X!)

And then swap the facet in when I hit an identifier:

locale token_loc( in.getloc(), new token_ctype );
…
locale const &oldloc = in.imbue( token_loc );
in.unget() >> token;
in.imbue( oldloc );

There seems to be surprisingly little lambda calculus code on the Web. Most of what I've found so far is full of unicode λ characters. So I thought to try adding Unicode support.

But ctype<wchar_t> works completely differently from ctype<char>. There is no master table; there are four methods do_is x2, do_scan_is, and do_scan_not. So I did this:

struct token_ctype : ctype< wchar_t > {
 typedef ctype<wchar_t> base;

 bool do_is( mask m, char_type c ) const {
  return base::do_is(m,c)
  || (m&space) && ( base::do_is(punct,c) || c == L'λ' );
 }

 const char_type* do_is
  (const char_type* lo, const char_type* hi, mask* vec) const {
  base::do_is(lo,hi,vec);
  for ( mask *vp = vec; lo != hi; ++ vp, ++ lo ) {
   if ( *vp & punct || *lo == L'λ' ) *vp |= space;
  }
  return hi;
 }

 const char_type *do_scan_is
  (mask m, const char_type* lo, const char_type* hi) const {
  if ( m & space ) m |= punct;
  hi = do_scan_is(m,lo,hi);
  if ( m & space ) hi = find( lo, hi, L'λ' );
  return hi;
 }

 const char_type *do_scan_not
  (mask m, const char_type* lo, const char_type* hi) const {
  if ( m & space ) {
   m |= punct;
   while ( * ( lo = base::do_scan_not(m,lo,hi) ) == L'λ' && lo != hi )
    ++ lo;
   return lo;
  }
  return base::do_scan_not(m,lo,hi);
 }
};

(Apologies for the flat formatting; the preview converted the tabs differently.)

The code is WAY less elegant. I does better express the notion that only punctuation is additional whitespace, but that would've been fine in the original had I had classic_table.

Is there a simpler way to do this? Do I really need all those overloads? (Testing showed do_scan_not is extraneous here, but I'm thinking more broadly.) Am I abusing facets in the first place? Is the above even correct? Would it be better style to implement less logic?

Solution

(It's been a year with no substantive answer, and I've learned a lot about iostreams in the meantime…)

The custom facet exists exclusively to serve the string extraction operator in >> token. That operator is defined in terms of use_facet< ctype< wchar_t > >( in.getloc() ).is( ios::space, c ) "for the next available input character c." (§21.3.7.9) ctype::is is simply a stub for ctype::do_is, so it would seem that do_is is sufficient.

Nevertheless, recent versions of the GCC standard library do implement operator>> in terms of scan_is. The catch is that do_scan_is is then implemented as a series of calls to do_is, virtual dispatch and all. The header file describes do_scan_is as a hook for user optimization.

So, it would seem that the as-if rule shelters an implementation that only provides the first override.

Note that the second override, which retrieves mask values, is an odd one out. It could be implemented in terms of the first, by inefficiently building the mask bit by bit. In GCC it is implemented in terms of system calls, inefficently building the mask bit by bit with 15 calls per character. This seems to sacrifice both performance and compatibility. Fortunately it seems nobody uses it.

Anyway, this is all well and good, but simply writing a tokenizer using streambuf_iterator<wchar_t> is easier, far more extensible, and simplifies exception handling.

OTHER TIPS

IMHO the code you posted is fine. You could implement some of the methods using others if you wanted simpler code (maybe at the expense of efficiency), but the way you did it is OK.

The disparity is based on the fact that people don't want to have several megabyte tables in their UNICODE programs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow