Python：テキストファイルの正規化

https://stackoverflow.com/questions/7374851

28-10-2019
|

質問

私は多くの単語のいくつかのスペルバリアントを含むテキストファイルを持っています：

例えば

identification ... ID .. identity...contract.... contr.... contractor...medicine...pills..tables

そのため、単語の同義語を含む同義語のテキストファイルが必要であり、すべてのバリアントをプライマリワードに置き換えたいと思います。基本的に、入力ファイルを正規化したいです。

たとえば、私の同義語リストファイルはどのようになりますか

identification = ID identify
contracting = contract contractor contractors contra...... 
word3 = word3_1 word3_2 word3_3 ..... word3_n
.
.
.
.
medicine = pills tables drugs...

エンド出力ファイルをどのように見せたいですか

identification ... identification .. identification...contractor.... contractor.... contractor...medicine...medicine..medicine

Pythonでのプログラミングについてどうやって入手できますか？

どうもありがとうございました!!!

解決

同義語ファイルを読み取り、辞書に変換することができます。 table:

import re

table={}
with open('synonyms','r') as syn:
    for line in syn:
        match=re.match(r'(\w+)\s+=\s+(.+)',line)
        if match:
            primary,synonyms=match.groups()
            synonyms=[synonym.lower() for synonym in synonyms.split()]
            for synonym in synonyms:
                table[synonym]=primary.lower()

print(table)

降伏

{'word3_1': 'word3', 'word3_3': 'word3', 'word3_2': 'word3', 'contr': 'contracting', 'contract': 'contracting', 'contractor': 'contracting', 'contra': 'contracting', 'identify': 'identification', 'contractors': 'contracting', 'word3_n': 'word3', 'ID': 'identification'}

次に、テキストファイルで読み取り、各単語をからの主要な同義語に置き換えることができます table:

with open('textfile','r') as f:
    for line in f:
        print(''.join(table.get(word.lower(),word) 
                      for word in re.findall(r'(\W+|\w+)',line)))

降伏

identification     identification    identity   contracting     contracting     contracting   medicine   medicine  medicine

re.findall(r'(\w+|\W+)',line) それぞれ分割されました line 空白を保存しながら。 Whitespaceが興味がない場合は、簡単に使用することもできます line.split().
table.get(word,word) 戻り値 table[word] 単語が入っている場合 table、そして単に戻ります word もしも word 同義語にはありません table.

他のヒント

ただの考え：のリストを持っている代わりに全て単語のバリエーション、見てください difflib

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow