I'm looking for a simple way to extract text from excel/word/ppt files. The objective is to index contents in whoosh for search with haystack.

There are some packages like xlrd and pandas that work for excel, but they go way beyond what I need, and I'm not really sure that they will actually just print the cell's unformatted text content straight from the box.

Anybody knows of an easy way around this? My guess is ms office files must be xml-shaped.

Thanks!

A.

有帮助吗?

解决方案

I've done this "by hand" before--as it turns out, .(doc|ppt|xls)x files are just zip files which contain .xml files with all of your content. So you can use zipfile and your favorite xml parser to read the contents if you can find no better tool to do it.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top