Question

I'm writing a compiler (for my own programming language) and I want to allow users to use any of the characters in the Unicode letter categories to define identifiers (modern languages, like Go allow such syntax already). I've read a lot about character encoding in C++11 and based on all the informations I've found out, it will be fine to use utf32 encoding (it is fast to iterate over in lexer and it has better support than utf8 in C++).

In C++ there is isalpha function. How can I test wchar32_t if it is a letter (a Unicode code point classified as "letter" in any language)?

Is it even possible?

Was it helpful?

Solution

Use ICU to iterate over the string and check whether the appropriate Unicode properties are fulfilled. Here is an example in C that checks whether the UTF-8 command line argument is a valid identifier:

#include <stdint.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/uchar.h>
#include <unicode/utf8.h>

int main(int argc, char **argv) {
  if (argc != 2) return EXIT_FAILURE;
  const char *const str = argv[1];
  int32_t off = 0;
  // U8_NEXT has a bug causing length < 0 to not work for characters in [U+0080, U+07FF]
  const size_t actual_len = strlen(str);
  if (actual_len > INT32_MAX) return EXIT_FAILURE;
  const int32_t len = actual_len;
  if (!len) return EXIT_FAILURE;
  UChar32 ch = -1;
  U8_NEXT(str, off, len, ch);
  if (ch < 0 || !u_isIDStart(ch)) return EXIT_FAILURE;
  while (off < len) {
    U8_NEXT(str, off, len, ch);
    if (ch < 0 || !u_isIDPart(ch)) return EXIT_FAILURE;
  }
}

Note that ICU here uses the Java definitions, which are slightly different from those in UAX #31. In a real application you might also want to normalize to NFC before.

OTHER TIPS

there is an isaplha in the ICU project. I think you can use that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top