检测单词中的音节

https://stackoverflow.com/questions/405161

03-07-2019
|

题

我需要找到一种相当有效的方法来检测单词中的音节。如，

隐形 - ＆gt;在-VI-SIB乐

可以使用一些音节规则：

V 简历虚电路 CVC CCV CCCV CVCC

*其中V是元音，C是辅音。例如，

发音（5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC）

我尝试了很少的方法，其中包括使用正则表达式（仅在你想要计算音节时有用）或硬编码规则定义（证明效率非常低的强力方法）并最终使用有限状态自动机（没有任何有用的结果）。

我的应用程序的目的是创建一个给定语言的所有音节的字典。该词典稍后将用于拼写检查应用程序（使用贝叶斯分类器）和文本到语音合成。

如果除了我以前的方法之外，我可以提供另一种方法来解决这个问题。

我在Java工作，但是C / C ++，C＃，Python，Perl ......中的任何提示都适用于我。

解决方案

为了连字，请阅读有关此问题的TeX方法。特别是看看Frank Liang的论文论文 Word Hy-phen-a-tion by Com-把-ER 的。他的算法非常准确，然后包含一个小例外字典，用于算法不起作用的情况。

其他提示

我偶然发现了这个页面，寻找同样的东西，并在这里找到了梁文的一些实现： https://github.com/mnater/hyphenator

除非你喜欢阅读60页的论文，而不是为非独特的问题调整免费的可用代码。：）

以下是使用 NLTK 的解决方案：

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

我正在尝试解决这个问题，该程序将计算一段文本的flesch-kincaid和flesch读数。我的算法使用我在本网站上找到的内容： http://www.howmanysyllables.com/howtocountsyllables.html它相当接近。它仍然在像隐形和连字符这样复杂的单词上遇到麻烦，但我发现它可以用于我的目的。

它具有易于实施的优点。我找到了“es”既可以是音节也可以不是音节。这是一场赌博，但我决定在我的算法中删除es。

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 && 
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

这是一个特别困难的问题，LaTeX连字算法无法完全解决这个问题。可以在论文中找到一些可用方法和所涉及的挑战的总结。评估英语的自动音节化算法（Marchand，Adsett和Damper 2007）。

感谢Joe Basirico，感谢您在C＃中分享快速而肮脏的实现。我使用过大型库，它们可以工作，但它们通常有点慢，对于快速项目，你的方法运行正常。

以下是Java中的代码以及测试用例：

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

结果如预期的那样（它对Flesch-Kincaid来说足够好）：

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

撞击@Tihamer和@ joe-basirico。非常有用的功能，不是完美，但对大多数中小型项目都有好处。 Joe，我用Python重写了你的代码实现：

def countSyllables(word):
    vowels = "aeiouy"
    numVowels = 0
    lastWasVowel = False
    for wc in word:
        foundVowel = False
        for v in vowels:
            if v == wc:
                if not lastWasVowel: numVowels+=1   #don't count diphthongs
                foundVowel = lastWasVowel = True
                        break
        if not foundVowel:  #If full cycle and no vowel found, set lastWasVowel to false
            lastWasVowel = False
    if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
        numVowels-=1
    elif len(word) > 1 and word[-1:] == "e":    #remove silent e
        numVowels-=1
    return numVowels

希望有人觉得这很有用！

Perl有 Lingua :: Phonology :: Syllable 模块。您可以尝试，或尝试查看其算法。我也看到了其他几个较旧的模块。

我不明白为什么正则表达式只给你一个音节数。您应该能够使用捕获括号自己获取音节。假设您可以构造一个有效的正则表达式，即。

今天我发现了这个 Java实现的Frank连字符算法，其中包含英语或德语模式非常好，可以在Maven Central上找到。

洞穴：删除 .tex 模式文件的最后几行很重要，因为否则这些文件无法在Maven Central上加载当前版本。

要加载和使用 hyphenator ，您可以使用以下Java代码段。 texTable 是包含所需模式的 .tex 文件的名称。这些文件可以在项目github网站上找到。

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);
            }

            public void info(String s) {
                logger.info(s);
            }

            public void warning(String s) {
                logger.warn("WARNING: " + s);
            }

            public void error(String s) {
                logger.error("ERROR: " + s);
            }

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);
            }

            public boolean isDebugged(String guard) {
                return false;
            }
        });

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
            hyphenator.loadTable(table);
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                    table.close();
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);
                }
            }
        }

        return hyphenator;
    }

之后 Hyphenator 就可以使用了。要检测音节，基本思路是将术语拆分为提供的连字符。

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

您需要拆分＆quot; \ u00AD ＆quot;，因为API不会返回正常的＆quot; - ＆quot; 。

这种方法优于Joe Basirico的答案，因为它支持许多不同的语言，并且检测德语连字更准确。

为什么计算它？每个在线词典都有这个信息。 http://dictionary.reference.com/browse/invisible 在＆＃183;可见＆＃183; I＆＃183;竹叶提取

谢谢@ joe-basirico和@tihamer。我已将@ tihamer的代码移植到Lua 5.1,5.2和luajit 2（很可能会在其他版本的lua上运行）：

<代码> countsyllables.lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

一些有趣的测试确认它有效（尽可能多）：

<代码> countsyllables.tests.lua

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")

我找不到足够的方法来计算音节，所以我自己设计了一种方法。

您可以在此处查看我的方法： https://stackoverflow.com/a/32784041/2734752

我使用字典和算法方法的组合来计算音节。

您可以在此处查看我的图书馆： https://github.com/troywatson/Lawrence-样式检查

我刚刚测试了我的算法，得分率达到了99.4％！

Lawrence lawrence = new Lawrence();

System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));

输出：

4
3

我刚才遇到了同样的问题。

我最终使用 CMU语音词典进行快速和准确查找大多数单词。对于不在字典中的单词，我回到机器学习模型，在预测音节计数方面准确率高达98％。

我将这一切包装在一个易于使用的python模块中： https：// github.com/repp/big-phoney

安装： pip install big-phoney

Count Syllables：

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

如果你没有使用Python而你想尝试基于ML模型的方法，我做了一个非常详细的写下音节计数模型如何在Kaggle上工作。

经过大量的测试和尝试连字包后，我根据一些例子编写了自己的连字符。我还尝试了与连字词典接口的 pyhyphen 和 pyphen 包，但在很多情况下它们会产生错误数量的音节。 nltk 包对于这个用例来说太慢了。

我在Python中的实现是我编写的类的一部分，并且下面粘贴了音节计数例程。它有点高估了音节的数量，因为我还没有找到一个很好的方法来解释无声的字结尾。

该函数返回每个单词的音节比例，因为它用于Flesch-Kincaid可读性分数。这个数字不一定非精确，只是足够接近估计值。

在我的第7代i7 CPU上，759字样本文本的此功能花了1.1-1.2毫秒。

def _countSyllablesEN(self, theText):

    cleanText = ""
    for ch in theText:
        if ch in "abcdefghijklmnopqrstuvwxyz'’":
            cleanText += ch
        else:
            cleanText += " "

    asVow    = "aeiouy'’"
    dExep    = ("ei","ie","ua","ia","eo")
    theWords = cleanText.lower().split()
    allSylls = 0
    for inWord in theWords:
        nChar  = len(inWord)
        nSyll  = 0
        wasVow = False
        wasY   = False
        if nChar == 0:
            continue
        if inWord[0] in asVow:
            nSyll += 1
            wasVow = True
            wasY   = inWord[0] == "y"
        for c in range(1,nChar):
            isVow  = False
            if inWord[c] in asVow:
                nSyll += 1
                isVow = True
            if isVow and wasVow:
                nSyll -= 1
            if isVow and wasY:
                nSyll -= 1
            if inWord[c:c+2] in dExep:
                nSyll += 1
            wasVow = isVow
            wasY   = inWord[c] == "y"
        if inWord.endswith(("e")):
            nSyll -= 1
        if inWord.endswith(("le","ea","io")):
            nSyll += 1
        if nSyll < 1:
            nSyll = 1
        # print("%-15s: %d" % (inWord,nSyll))
        allSylls += nSyll

    return allSylls/len(theWords)

我用jsoup做了一次。这是一个示例音节解析器：

public String[] syllables(String text){
        String url = "https://www.merriam-webster.com/dictionary/" + text;
        String relHref;
        try{
            Document doc = Jsoup.connect(url).get();
            Element link = doc.getElementsByClass("word-syllables").first();
            if(link == null){return new String[]{text};}
            relHref = link.html(); 
        }catch(IOException e){
            relHref = text;
        }
        String[] syl = relHref.split("·");
        return syl;
    }

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow