How to load multiple XML files of corpora with NLTK and use it as a whole with Text class?

https://stackoverflow.com/questions/10179430

01-06-2021
|

Question

Folks, I've put together a set of corpora for NLTK which are basically simple XML files. I can load it just fine like that:

>>> from nltk.corpus import cicero
>>> print cicero.fileids()
['cicero_academica.xml', 'cicero_arati_phaenomena.xml', ...]

Now, I understand XMLCorpusReader won't give my the content of all those XML files at once because it expects only one single XML at once to processe, right? I tried to bypass it writing a for loop, putting it all in a list and give it to XMLCorpusReader but no luck...

Simply put: how could I load multiple XML corpora with NLTK and run .words() in all of them at once? Working code examples would be good.

It seems that I can't load all XML at once and use them, say, with class Text() to, say again, print concordances of a word through ALL the XML files, not only through one at a time.

Is there any work around or real NLTK solution for this? Should I write a magical inherited class of XMLCorpusReader that does it? Should I drop XML and go for flat files...?

This is similar to my problem, but so far I think the answers there are not really useful NLTK-wise: Can NLTK's XMLCorpusReader be used on a multi-file corpus?

Solution

Not exactly what I was after but it solved the problem for now. I'll play around with it a bit more, so perhaps this will turn out different later on. Anyway, a small working test :-)

# http://stackoverflow.com/questions/6849600/does-anyone-have-a-categorized-xml-corpus-reader-for-nltk
from CatXMLReader import CategorizedXMLCorpusReader

from nltk.corpus import cicero
from nltk import Text

fileids = cicero.abspaths()
reader = CategorizedXMLCorpusReader('/', fileids, cat_file='cats.txt')
words = Text(reader.words(fileids))
print words.concordance('et')

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow