COM/オートメーションを使用せずに Word ドキュメントからテキストを抽出する最良の方法は?

https://stackoverflow.com/questions/42482

09-06-2019
|

質問

COM オートメーションに依存せずに Word ファイルからプレーンテキストを抽出する合理的な方法はありますか?(これは、Windows 以外のプラットフォームにデプロイされた Web アプリの機能です。この場合、交渉の余地はありません。)

Antiword は合理的な選択肢のように思えますが、放棄される可能性もあるようです。

Python ソリューションが理想的ですが、利用できないようです。

解決

これには、解析しやすい結果が得られるものであれば何でも、catdoc または antiword を使用します。これを Python 関数に埋め込んだので、(Python で書かれた) 解析システムから簡単に使用できます。

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

ところで、catdoc への -w スイッチは行の折り返しをオフにします。

他のヒント

（と同じ答え PythonでMS Wordファイルからテキストを抽出する)

今週作成したネイティブ Python docx モジュールを使用します。ドキュメントからすべてのテキストを抽出する方法は次のとおりです。

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

見る Python DocX サイト

100% Python、COM、.net、Java なし、正規表現を使用したシリアル化された XML の解析なし、ゴミなし。

Word ファイル (.docx) からテキストを抽出したいだけであれば、Python だけでそれを行うことができます。Guy Starbuck が書いたように、ファイルを解凍して XML を解析するだけです。に触発された python-docx, 、と書きました。単純な機能これをする：

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

OpenOffice APIとPythonの使用、およびアンドリュー・ピトニャックの優れたオンラインマクロ本なんとかこれができました。セクション 7.16.4 が開始点です。

画面をまったく必要とせずに機能させるためのもう 1 つのヒントは、Hidden プロパティを使用することです。

RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )

そうしないと、ドキュメントを開いたときに画面 (おそらく Web サーバーコンソール上) 上でドキュメントがフリックアップされます。

オープンオフィスには、 API

docx ファイルについては、次の場所で入手可能な Python スクリプト docx2txt を確認してください。

http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt

docx ドキュメントからプレーンテキストを抽出します。

ティカパイソン

Apache Tika ライブラリの Python ポート。ドキュメントによると、Apache tika は 1500 を超えるファイル形式からのテキスト抽出をサポートしています。

注記： それはまた魅力的に動作します pyインストーラー

pip でインストールします。

pip install tika

サンプル：

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

公式へのリンク GitHub

これはうまくいきました .doc と .odt の場合。

コマンドラインで openoffice を呼び出してファイルをテキストに変換し、それを Python にロードするだけです。

(文書化されていないようですが、他の形式オプションもあるようです。)

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow