Question

I'm using a .NET port of Mecab (called NMecab) to try to parse Japanese Hiragana, Katakana, and Kanji to romaji.

Here's my code:

using NMeCab;    
MeCabTagger _tagger;

public string Parse(string input)
{
    _tagger = MeCabTagger.Create();
    _tagger.OutPutFormatType = "lattice";
    _tagger.LatticeLevel = MeCabLatticeLevel.Two;


    var output = _tagger.Parse(input);

    return output;
}

When I call Parse(input) using the following Japanese text: "ども"

I get the output: "ども助詞,接続助詞,,,,,ども,ドモ,ドモ EOS"

I'm looking for the romaji of "ども", which would be "domo."

I've tried to use Mecab directly as discussed in this SO answer, but get the same output.

Was it helpful?

Solution

To my knowledge none of the dictionaries used by MeCab (IPA, Jumandic, or Unidic) includes romaji transcription of words. And actually there is no need for that:

  1. There exist different transcription schemes (e.g. Hepburn, kunrei, 99 siki);

  2. Information on the pronunciation of lexical units is already available (e.g. ドモ).

You have to write your own transcription routine... or look for an existing katakana-romaji transcription module (compatible with your transcription scheme)...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top