Split a sentence into separate words

https://stackoverflow.com/questions/2140212

22-09-2019
|

Question

I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼怎么走).

At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:

try to find the first two characters of the sentence in the database (主楼),
if 主楼 is actually a word and it's in the database the script will try to find first three characters (主楼怎). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.
try to do it with the rest of the characters.

I don't really like this approach, because to analyze even a small text it would query the database too many times.

Are there any other solutions to this?

Solution

Thanks to everyone for you help!

After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.

A PHP class (http://www.phpclasses.org/browse/package/2431.html)
A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)
A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)
There are some other solutions availabe if you try searching baidu.com for "中文分词"

Sincerely,

Equ

OTHER TIPS

You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.

You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.

With decent indexing on the word column though, you shouldn't have too many problems.

Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.

At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.

Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.

Another one that works well is http://www.itgrass.com/phpanalysis/index.html

Its the only one that I found that works properly with utf-8. The rest only worked for me in gb18030, which caused tons of issues later on down the line. I thought I was going to have to start over, but this one saved me a lot of time.

Well, if you have a database with all words and there is no other way to get those word involved I think you are forced to re-query the database.

To improve the performance of this, can't you do all these checks before you insert the sentence into the database, and add spaces yourself?

(using ABCDE to represent Chinese characters for simplicity)

Let's say you've got the 'sentence' ABCDE input, and your dictionary contains these words that start with A: AB, ABC, AC, AE, and ABB. And presume that the word CDE exists, but DE, nor E do not.

When parsing the input sentence, going left to right, the script pulls the first character A. Instead of querying the database to see if A is a word, query the database to pull all words that start with A.

Loop through those results, grabbing the next few characters from the input string to get a proper comparison:

AB  ?= AB : True
ABC ?= ABC: True
AC  ?= AB : False
AE  ?= AB : False
ABB ?= ABC: False

At this point the program forks down the two 'true' branches it found. On the first, it presumes AB is the first word, and tries to find C-starting words. CDE is found, so that branch is possible. Down the other branch, ABC is the first word, but DE is not possible, so that branch is invalid, meaning the first must be the true interpretation.

I think this method minimized the number of calls to the database (though it might return larger sets from the database, as you're fetching sets of words all starting with the same character). If your database were indexed for this sort of searching, I think this would work better than going letter-by letter. Looking at this whole process now, and the other answers, I think this is actually a trie structure (assuming the character searched for is the root of a tree), as another poster had suggested. Well, here's an implementation of that idea!

I do realize that the chinese word segmentation problem is a very complex one, but in some cases this trivial algorithm may be sufficient: search the longest word w starting with the ith character, then start again for the i+length(w)th character.

Here's a Python implementation:

#!/usr/bin/env python
# encoding: utf-8

import re
import unicodedata
import codecs

class ChineseDict:

    def __init__(self,lines,rex):
        self.words = set(rex.match(line).group(1) for line in lines if not line.startswith("#"))
        self.maxWordLength = max(map(len,self.words))

    def segmentation(self,text):
        result = []
        previousIsSticky = False
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    break
            sticky = len(s)==1 and unicodedata.category(s)!="Lo"
            if previousIsSticky or (result and sticky):
                result[-1] += s
            else:
                result.append(s)
            previousIsSticky = sticky
            i = j
        return u" | ".join(result)

    def genWords(self,text):
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    yield s
                    break
            i = j


if __name__=="__main__":
    cedict = ChineseDict(codecs.open("cedict_ts.u8",'r','utf-8'),re.compile(r"(?u)^.+? (.+?) .+"))
    text = u"""33. 你可以叫我夏尔
    戴高乐将军和夫人在科隆贝双教堂村过周末。星期日早晨，伊冯娜无意中走进浴室，正巧将军在洗盆浴。她感到非常意外，不禁大叫一声：“我的上帝！”
    戴高乐于是转过身，看见妻子因惊魂未定而站立在门口。他继续用香皂擦身，不紧不慢地说：“伊冯娜，你知道，如果是我们之间的隐私，你可以叫我夏尔，用不着叫我上帝……”
    """
    print cedict.segmentation(text)
    print u" | ".join(cedict.genWords(text))

The last part uses a copy of the CCEDICT dictionary to segment a (simplified) chinese text in two flavours (resp., with and without non-word characters):

33. 你 | 可以 | 叫 | 我 | 夏 | 尔
    戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末。星期日 | 早晨，伊 | 冯 | 娜 | 无意中 | 走进 | 浴室，正巧 | 将军 | 在 | 洗 | 盆浴。她 | 感到 | 非常 | 意外，不禁 | 大 | 叫 | 一声：“我的 | 上帝！”
    戴高乐 | 于是 | 转 | 过 | 身，看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口。他 | 继续 | 用 | 香皂 | 擦 | 身，不 | 紧 | 不 | 慢 | 地 | 说：“伊 | 冯 | 娜，你 | 知道，如果 | 是 | 我们 | 之间 | 的 | 隐私，你 | 可以 | 叫 | 我 | 夏 | 尔，用不着 | 叫 | 我 | 上帝……”

你 | 可以 | 叫 | 我 | 夏 | 尔 | 戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末 | 星期日 | 早晨 | 伊 | 冯 | 娜 | 无意中 | 走进 | 浴室 | 正巧 | 将军 | 在 | 洗 | 盆浴 | 她 | 感到 | 非常 | 意外 | 不禁 | 大 | 叫 | 一声 | 我的 | 上帝 | 戴高乐 | 于是 | 转 | 过 | 身 | 看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口 | 他 | 继续 | 用 | 香皂 | 擦 | 身 | 不 | 紧 | 不 | 慢 | 地 | 说 | 伊 | 冯 | 娜 | 你 | 知道 | 如果 | 是 | 我们 | 之间 | 的 | 隐私 | 你 | 可以 | 叫 | 我 | 夏 | 尔 | 用不着 | 叫 | 我 | 上帝

A good and fast way to segment Chinese text is based on Maximum Matching Segmentation, which is basically will test different length of words to see which combination of segmentation is most likely. It takes in a list of all possible words to do so.

Read more about it here: http://technology.chtsai.org/mmseg/

That's the method I use in my 读者 (DuZhe) Text Analyzer ( http://duzhe.aaginskiy.com ). I don't use a database, actually I pre-load a list of words into an array which does take up about ~2MB of RAM, but executes very quickly.

If you are looking into using lexical segmentation over statistical (though statistical method can be as accurate as ~97% according to some research), a very good segmentation tool is ADSOtrans that can be found here: http://www.adsotrans.com

It uses a database but has a lot of redundant tables to speed up the segmentation. You can also provide grammatical definitions to assist the segmentation.

This is a fairly standard task in computational linguistics. It goes by the name "tokenization" or "word segmentation." Try searching for "chinese word segmentation" or "chinese tokenization" and you'll find several tools that have been made to do this task, as well as papers about research systems to do it.

To do this well, you typically will need to use a statistical model built by running a machine learning system on a fairly large training corpus. Several of the systems you can find on the web come with pre-trained models.

You can build very very long Regular Expression.

Edit: I meant to build it automatically with script from the DB. Not to write it by hand.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow