검출 음절어

https://stackoverflow.com/questions/405161

03-07-2019
|

문제

나를 찾을 필요가 매우 효율적인 방법을 감지하는 음절 단어입니다.E.g.,

눈에 보이지 않는->에서 vi-sib-le

거기에 몇 가지 syllabification 규칙을 사용할 수 있는:

V CV VC CVC CCV CCCV CVCC

*어디 V 모음 애플리케이션은 다음과 같은 자음.E.g.,

발음(5Pro-nun-ci-a-tion;CV-CVC-CV-V-CVC)

나는 몇 가지 방법을 시도는 사이에 사용하던 정규식(는 데 도움이하려는 경우에만 계산하는 음절)또는 하드 코딩된 규칙 정의(brute force 접근 방식을 증명하는 것이 매우 비효율적이)그리고 마지막으로 사용하는 유한 상태 오토마타(하지 않은 결과 아무것도 유용합니다).

의 목적은 내용을 만드는 사전의 모든 음절에는 언어입니다.이 사전에서 나중에 사용한 맞춤법 검사 프로그램(를 사용하여 베이즈 분류)및 음성 텍스트기도 합니다.

면 감사하겠나 쉽게 확인할 수 있게 되었습니다 끝에는 다른 방법으로 이 문제를 해결하기 위해 게다가 내 이전에 접근한다.

서 Java 지만,어떤 팁에서는 C/C++,C#,Python,Perl...작동할 것입니다.

해결책

하이픈화 목적 으로이 문제에 대한 TEX 접근법에 대해 읽으십시오. 특히 Frank Liang을 참조하십시오 논문 논문 com-put-er의 Word hy-phen-ation. 그의 알고리즘은 매우 정확하며 알고리즘이 작동하지 않는 경우에 작은 예외 사전이 포함되어 있습니다.

다른 팁

나는이 페이지를 우연히 발견하여 같은 것을 찾아서 Liang 논문의 몇 가지 구현을 발견했습니다.https://github.com/mnater/hyphenator

그것은 당신이 비 일련의 문제에 대한 자유롭게 사용 가능한 코드를 조정하는 대신 60 페이지 논문을 읽는 것을 즐기는 유형이 아니라면입니다. :)

다음은 사용하는 솔루션입니다 NLTK:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

려고 노력해요 이런 문제를 해결하기 위해 프로그램을 계산하는 flesch-트 찬양 및 flesch 읽는 점수의 블록의 텍스트입니다.나의 알고리즘을 사용하여 내가 무엇을 발견에이 웹 사이트: http://www.howmanysyllables.com/howtocountsyllables.html 그리고 합리적인 가깝습니다.그것은 여전히 문제가 복잡한 단어는 같은 보이지 않는 하이픈,하지만 나는 그것에는 야구장 내 목적을 위해.

그것은 거꾸로의 쉽게 구현할 수 있습니다.내가 찾는"에스"중 하나가 될 수 있습 음절 또는하지 않습니다.그것은 도박,하지만 제거하기로 결정했 es 에 알고리즘이 있습니다.

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 && 
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

이것은 라텍스 하이픈 알고리즘에 의해 완전히 해결되지 않는 특히 어려운 문제입니다. 이용 가능한 몇 가지 방법과 관련된 문제에 대한 좋은 요약은 논문에서 찾을 수 있습니다. 영어에 대한 자동 음절 알고리즘 평가 (Marchand, Adsett 및 Damper 2007).

C#에서 빠르고 더러운 구현을 공유해 주신 Joe Basirico에게 감사드립니다. 나는 큰 라이브러리를 사용했지만 작동하지만 일반적으로 약간 느리고 빠른 프로젝트의 경우 방법이 잘 작동합니다.

테스트 사례와 함께 Java의 코드는 다음과 같습니다.

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

결과는 예상대로 (Flesch-Kincaid에 충분히 잘 작동합니다) :

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

@tihamer 및 @joe-basirico 범프. 매우 유용한 기능이 아닙니다 완벽한, 그러나 대부분의 소규모 프로젝트에 좋습니다. Joe, Python에서 귀하의 코드 구현을 다시 작성했습니다.

def countSyllables(word):
    vowels = "aeiouy"
    numVowels = 0
    lastWasVowel = False
    for wc in word:
        foundVowel = False
        for v in vowels:
            if v == wc:
                if not lastWasVowel: numVowels+=1   #don't count diphthongs
                foundVowel = lastWasVowel = True
                        break
        if not foundVowel:  #If full cycle and no vowel found, set lastWasVowel to false
            lastWasVowel = False
    if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
        numVowels-=1
    elif len(word) > 1 and word[-1:] == "e":    #remove silent e
        numVowels-=1
    return numVowels

누군가가 이것이 유용하다는 것을 알기를 바랍니다!

Perl이 있습니다 lingua :: 음성 :: 음절 기준 치수. 당신은 그것을 시도하거나 알고리즘을 조사해 볼 수도 있습니다. 다른 오래된 모듈도 보았습니다.

정규 표현이 왜 당신에게 음절 수만 제공하는 이유를 이해하지 못합니다. 캡처 괄호를 사용하여 음절 자체를 얻을 수 있어야합니다. 작동하는 정규 표현을 구성 할 수 있다고 가정합니다.

오늘 나는 찾았다 이것 영어 또는 독일어 패턴을 가진 Frank Liang의 하이픈 알고리즘의 Java 구현은 매우 잘 작동하며 Maven Central에서 제공됩니다.

동굴 : 마지막 줄을 제거하는 것이 중요합니다. .tex 패턴 파일, 그렇지 않으면 해당 파일은 Maven Central의 현재 버전으로로드 할 수 없기 때문입니다.

로드하고 사용합니다 hyphenator, 다음 Java 코드 스 니펫을 사용할 수 있습니다. texTable 이름입니다 .tex 필요한 패턴이 포함 된 파일. 이 파일은 프로젝트 Github 사이트에서 사용할 수 있습니다.

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);
            }

            public void info(String s) {
                logger.info(s);
            }

            public void warning(String s) {
                logger.warn("WARNING: " + s);
            }

            public void error(String s) {
                logger.error("ERROR: " + s);
            }

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);
            }

            public boolean isDebugged(String guard) {
                return false;
            }
        });

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
            hyphenator.loadTable(table);
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                    table.close();
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);
                }
            }
        }

        return hyphenator;
    }

그 후 Hyphenator 사용할 준비가되었습니다. 음절을 감지하려면 기본 아이디어는 제공된 하이픈에서 용어를 분할하는 것입니다.

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

당신은 분할이 필요합니다 "\u00AD"API가 정상을 반환하지 않기 때문에 "-".

이 접근법은 Joe Basirico의 답변을 능가합니다. 많은 언어를 지원하고 독일 하이픈을보다 정확하게 감지하기 때문입니다.

왜 계산합니까? 모든 온라인 사전에는이 정보가 있습니다. http://dictionary.reference.com/browse/invisible보이지 않는

@joe-basirico와 @tihamer 감사합니다. @tihamer의 코드를 Lua 5.1, 5.2 및 Luajit 2로 포팅했습니다.아마도 다른 버전의 LUA에서도 실행될 것입니다.):

countsyllables.lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

그리고 그것이 작동하는지 확인하기위한 재미있는 테스트 (예상만큼):

countsyllables.tests.lua

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")

음절을 계산하는 적절한 방법을 찾을 수 없었기 때문에 스스로 방법을 디자인했습니다.

여기에서 내 방법을 볼 수 있습니다. https://stackoverflow.com/a/32784041/2734752

음절을 계산하기 위해 사전과 알고리즘 방법의 조합을 사용합니다.

여기에서 내 도서관을 볼 수 있습니다. https://github.com/troywatson/lawrence-style-checker

방금 알고리즘을 테스트하고 99.4%의 스트라이크 율을 가졌습니다!

Lawrence lawrence = new Lawrence();

System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));

산출:

4
3

나는 얼마 전에이 똑같은 문제를 해결했다.

나는 결국 그것을 사용했다 CMU 발음 사전 대부분의 단어의 빠르고 정확한 조회를 위해. 사전에 있지 않은 단어의 경우 음절 수를 예측할 때 ~ 98% 정확한 기계 학습 모델로 돌아 왔습니다.

여기에서 모든 것을 사용하기 쉬운 파이썬 모듈로 마무리했습니다. https://github.com/repp/big-phoney

설치:pip install big-phoney

음절 수 :

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

Python을 사용하지 않고 ML 모델 기반 접근법을 시도하고 싶다면 매우 상세한 작업을 수행했습니다. 음절 계산 모델이 Kaggle에서 어떻게 작동하는지 작성하십시오..

많은 테스트를 수행하고 하이픈 패키지를 시험해 보면 여러 예제를 기반으로 직접 작성했습니다. 나는 또한 시도했다 pyhyphen 그리고 pyphen 하이픈화 사전과 인터페이스하는 패키지이지만 많은 경우에 잘못된 음절을 생성합니다. 그만큼 nltk 이 사용 사례는 패키지가 너무 느 렸습니다.

Python에서의 나의 구현은 내가 쓴 클래스의 일부이며, 음절 계산 루틴은 아래에 붙여져 있습니다. 침묵의 단어 결말을 설명하는 좋은 방법을 찾지 못했기 때문에 음절 수를 약간 과대 평가합니다.

이 함수는 Flesch-Kincaid 가독성 점수에 사용되므로 단어 당 음절의 비율을 반환합니다. 숫자가 정확할 필요는 없으며 견적을 위해 충분히 가깝습니다.

7 세대 I7 CPU 에서이 기능은 759 단어 샘플 텍스트의 경우 1.1-1.2 밀리 초가 걸렸습니다.

def _countSyllablesEN(self, theText):

    cleanText = ""
    for ch in theText:
        if ch in "abcdefghijklmnopqrstuvwxyz'’":
            cleanText += ch
        else:
            cleanText += " "

    asVow    = "aeiouy'’"
    dExep    = ("ei","ie","ua","ia","eo")
    theWords = cleanText.lower().split()
    allSylls = 0
    for inWord in theWords:
        nChar  = len(inWord)
        nSyll  = 0
        wasVow = False
        wasY   = False
        if nChar == 0:
            continue
        if inWord[0] in asVow:
            nSyll += 1
            wasVow = True
            wasY   = inWord[0] == "y"
        for c in range(1,nChar):
            isVow  = False
            if inWord[c] in asVow:
                nSyll += 1
                isVow = True
            if isVow and wasVow:
                nSyll -= 1
            if isVow and wasY:
                nSyll -= 1
            if inWord[c:c+2] in dExep:
                nSyll += 1
            wasVow = isVow
            wasY   = inWord[c] == "y"
        if inWord.endswith(("e")):
            nSyll -= 1
        if inWord.endswith(("le","ea","io")):
            nSyll += 1
        if nSyll < 1:
            nSyll = 1
        # print("%-15s: %d" % (inWord,nSyll))
        allSylls += nSyll

    return allSylls/len(theWords)

나는 JSOUP을 사용하여 이것을 한 번 수행했습니다. 다음은 샘플 음절 파서입니다.

public String[] syllables(String text){
        String url = "https://www.merriam-webster.com/dictionary/" + text;
        String relHref;
        try{
            Document doc = Jsoup.connect(url).get();
            Element link = doc.getElementsByClass("word-syllables").first();
            if(link == null){return new String[]{text};}
            relHref = link.html(); 
        }catch(IOException e){
            relHref = text;
        }
        String[] syl = relHref.split("·");
        return syl;
    }

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow