How can I standardize Japanese so I can do a word check for prohibited words in Lua?

https://stackoverflow.com/questions/15088255

11-03-2022
|

Question

There are too many combinations of half-width, full-width, katakana, hiragana, kanji, and using substitute characters (eg そ instead of ん).

Python has a package called jcconv which would help me do what I need to do. I want to convert strings into a standard form so I can go down my restricted word list.

Is this possible in Lua?

Solution

To be able to convert strings between hiragana, katakana and half width katakana you could store the respective alphabet characters in different tables, and add a mapping between them (either by index or by key).

This is how jcconv is doing this too, judging by the source (link).

For example, if you want to convert hiragana to katakana you could do like this:

set up a table where each element is defined as [hiragana] = katakana.
iterate the string character by character and substitute if it is the case (I found a little library that does exactly this: utf8.lua provides a substitution function which accepts a mapping table).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow