문제

I'm looking for a simple way to extract text from excel/word/ppt files. The objective is to index contents in whoosh for search with haystack.

There are some packages like xlrd and pandas that work for excel, but they go way beyond what I need, and I'm not really sure that they will actually just print the cell's unformatted text content straight from the box.

Anybody knows of an easy way around this? My guess is ms office files must be xml-shaped.

Thanks!

A.

도움이 되었습니까?

해결책

I've done this "by hand" before--as it turns out, .(doc|ppt|xls)x files are just zip files which contain .xml files with all of your content. So you can use zipfile and your favorite xml parser to read the contents if you can find no better tool to do it.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top