Вопрос

I'm looking for a simple way to extract text from excel/word/ppt files. The objective is to index contents in whoosh for search with haystack.

There are some packages like xlrd and pandas that work for excel, but they go way beyond what I need, and I'm not really sure that they will actually just print the cell's unformatted text content straight from the box.

Anybody knows of an easy way around this? My guess is ms office files must be xml-shaped.

Thanks!

A.

Это было полезно?

Решение

I've done this "by hand" before--as it turns out, .(doc|ppt|xls)x files are just zip files which contain .xml files with all of your content. So you can use zipfile and your favorite xml parser to read the contents if you can find no better tool to do it.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top