从python中的MS word文件中提取文本

https://stackoverflow.com/questions/125222

02-07-2019
|

题

在python中使用MS word文件，有python win32扩展，可以在windows中使用。我如何在linux中做同样的事情？有没有图书馆？

解决方案

您可以对反词进行子流程调用。 Antiword是一个linux命令行实用程序，用于从单词doc中转储文本。适用于简单文档（显然它会丢失格式）。它可以通过apt，也可以作为RPM，或者你可以自己编译。

其他提示

使用原生Python docx模块。以下是如何从doc中提取所有文本：

document = docx.Document(filename)
docText = '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
print docText

请参阅 Python DocX网站

另请查看 Textract ，其中提取表等。

使用正则表达式解析XML会调用cthulu。不要这样做！

本杰明的回答非常好。我刚刚巩固了......

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

OpenOffice.org可以用Python编写脚本：见这里。

由于OOo可以完美地加载大多数MS Word文件，我认为这是你最好的选择。

我知道这是一个老问题，但我最近试图找到一种从MS word文件中提取文本的方法，到目前为止我发现的最佳解决方案是使用wvLib：

http://wvware.sourceforge.net/

安装库后，在Python中使用它非常简单：

import commands

exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)

就是这样。实际上，我们正在做的是使用commands.getouput函数来运行几个shell脚本，即wvText（从Word文档中提取文本，以及cat来读取文件输出）。之后，Word文档中的整个文本将出现在out变量中，随时可以使用。

希望这将有助于将来遇到类似问题的任何人。

看看 doc格式如何工作和 Abiword 是我推荐的工具。尽管如此，还有限制：

但是，如果文档包含复杂的表格，文本框，嵌入的电子表格等，则可能无法按预期工作。开发好的MS Word过滤器是一个非常困难的过程，因此我们要努力让Word文档正确打开。如果您有一个无法加载的Word文档，请打开一个Bug并包含该文档，以便我们可以改进导入程序。

（注意：我是在上发布的这个问题，但这似乎与此相关，所以请原谅转发。）

现在，这非常丑陋且非常hacky，但它似乎对我来说是基本的文本提取。显然在Qt程序中使用它你必须为它生成一个进程等，但我一起攻击的命令行是：

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

那就是：

unzip -p file.docx ： - p == <！> quot;解压缩到stdout <！>“;

grep'<！> lt; w：t'：抓住包含'<！> lt; w：t'的行（<！> lt; w：t <！> gt ;是<！> quot; text <！>“的Word 2007 XML元素，据我所知）

sed's / <！> lt; [^ <！> lt;] <！> gt; // g'*：删除标签内的所有内容

grep -v'^ [[：space：]] $'*：删除空行

这可能是一种更有效的方法，但它似乎对我使用的几个文档起作用。

据我所知，unzip，grep和sed都有适用于Windows和任何Unix的端口，所以它应该是合理的跨平台。鄙视是一个丑陋的黑客;）

如果您打算在不调用子进程的情况下使用纯python模块，则可以使用zipfile python modude。

content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
    if item.orig_filename == 'word/document.xml':
        content = docx.read(item.orig_filename)

    else:
        pass

然而，您的内容字符串需要清理，其中一种方法是：

# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
    if '>' in item:
        bad_good = item.split('>')
        if bad_good[-1] != '':
            fullyclean.append(bad_good[-1])
        else:
            pass
    else:
        pass

# Assemble a new string with all pure content
content = " ".join(fullyclean)

但是有一种更优雅的方式来清理字符串，可能使用re模块。希望这会有所帮助。

Unoconv也可能是一个不错的选择： http://linux.die.net/man/ 1 / unoconv

如果您安装了LibreOffice，您只需从命令行调用它即可将文件转换为文本，然后将文本加载到Python中。

我不确定你是否会在没有使用COM的情况下获得太多运气。 .doc格式非常复杂，通常称为<！>“内存转储<！>”;在保存时的Word！

在Swati，这是用HTML编写的，这很好用，但大多数word文档都不太好！

要阅读Word 2007及更高版本的文件，包括.docx文件，您可以使用 python-docx 包：

from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')

要从Word 2003及更早版本中读取.doc文件，请对 antiword 进行子流程调用。您需要先安装反词：

sudo apt-get install antiword

然后从你的python脚本中调用它：

import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))

这是一个老问题吗？我相信这样的事情不存在。只有答案和未答复的答案。如果你愿意的话，这个是没有答案的，或者半答案。那么，不使用COM互操作读取* .docx（MS Word 2007及更高版本）文档的方法都包括在内。但是，仅使用Python从* .doc（MS Word 97-2000）中提取文本的方法缺乏。这很复杂吗？要做：不是真的，要理解：嗯，那是另一回事。

当我没有找到任何完成的代码时，我阅读了一些格式规范，并用其他语言挖出了一些提议的算法。

MS Word（* .doc）文件是OLE2复合文件。不要用很多不必要的细节打扰你，把它想象成存储在文件中的文件系统。它实际上使用FAT结构，因此定义成立。（嗯，也许你可以在Linux上循环安装它???）通过这种方式，您可以在文件中存储更多文件，如图片等。通过使用ZIP存档，在* .docx中也是如此。 PyPI上有可用于读取OLE文件的包。喜欢（olefile，compoundfiles，...）我使用了compoundfiles包来打开* .doc文件。但是，在MS Word 97-2000中，内部子文件不是XML或HTML，而是二进制文件。由于这还不够，每个都包含有关其他信息的信息，因此您必须至少读取其中的两个并相应地解析存储的信息。要完全理解，请阅读我从中获取算法的PDF文档。

下面的代码非常仓促地编写并测试了少量文件。据我所见，它按预期工作。有时一些乱码出现在开头，几乎总是出现在文本的末尾。并且中间也可能存在一些奇怪的字符。

那些只想搜索文字的人会很高兴。尽管如此，我仍然敦促任何可以帮助改进此代码的人这样做。


doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf

Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
    * Did the author of original algorithm used uint32 and int32 when unpacking correctly?
      I copied each occurence as in original algo.
    * Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
    * Did I interpret each C# command correctly?
      I think I did!
"""

from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack

__all__ = ["doc2text"]

def doc2text (path):
    text = u""
    cr = CompoundFileReader(path)
    # Load WordDocument stream:
    try:
        f = cr.open("WordDocument")
        doc = f.read()
        f.close()
    except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
    # Extract file information block and piece table stream informations from it:
    fib = doc[:1472]
    fcClx  = unpack("L", fib[0x01a2l:0x01a6l])[0]
    lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
    tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
    tableName = ("0Table", "1Table")[tableFlag]
    # Load piece table stream:
    try:
        f = cr.open(tableName)
        table = f.read()
        f.close()
    except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
    cr.close()
    # Find piece table inside a table stream:
    clx = table[fcClx:fcClx+lcbClx]
    pos = 0
    pieceTable = ""
    lcbPieceTable = 0
    while True:
        if clx[pos]=="\x02":
            # This is piece table, we store it:
            lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
            pieceTable = clx[pos+5:pos+5+lcbPieceTable]
            break
        elif clx[pos]=="\x01":
            # This is beggining of some other substructure, we skip it:
            pos = pos+1+1+ord(clx[pos+1])
        else: break
    if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
    # Read info from pieceTable, about each piece and extract it from WordDocument stream:
    pieceCount = (lcbPieceTable-4)/12
    for x in xrange(pieceCount):
        cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
        cpEnd   = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
        ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
        pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
        fcValue = unpack("L", pieceDescriptor[2:6])[0]
        isANSII = (fcValue & 0x40000000) == 0x40000000
        fc      = fcValue & 0xbfffffff
        cb = cpEnd-cpStart
        enc = ("utf-16", "cp1252")[isANSII]
        cb = (cb*2, cb)[isANSII]
        text += doc[fc:fc+cb].decode(enc, "ignore")
    return "\n".join(text.splitlines())

只需选择不使用COM即可阅读'doc'文件： miette 。应该适用于任何平台。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow