Question

I'm a beginner at this but I'd like to create a folder where I have many texts (lets say novels saved as .txt). I'd then like to ask the user to select one of these novels and then automatically have the part-of-speech-tagger analize the entire text. Is this possible? I've been trying with:

text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

How do I make it analyse the text the user has selected instead of this sentence? And how do I import these texts?

Was it helpful?

Solution

There's a couple of ways to read a directory of textfiles.

Let's try the native python way first, from the terminal/console/command prompt:

~$ mkdir ~/testcorpora
~$ cd ~/testcorpora/
~/testcorpora$ ls
~/testcorpora$ echo 'this is a foo foo bar bar.\n bar foo, dah dah.' > somefoobar.txt
~/testcorpora$ echo 'what are you talking about?' > talkingabout.txt
~/testcorpora$ ls
somefoobar.txt  talkingabout.txt
~/testcorpora$ cd ..
~$ python
>>> import os
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> corpus_directory = 'testcorpora/'
>>> for infile in os.listdir(corpus_directory):
...     with open(corpus_directory+infile, 'r') as fin:
...             pos_tag(word_tokenize(fin.read()))
... 
[('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')]
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar.\\n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.')]

The other solution is using PlaintextCorpusReader in NLTK, then run word_tokenize and pos_tag on the corpus see Creating a new corpus with NLTK:

>>> from nltk.corpus.reader.plaintext import PlaintextCorpusReader
>>> from nltk.tag import pos_tag
>>> corpusdir = 'testcorpora/'
>>> newcorpus = PlaintextCorpusReader(corpusdir,'.*')
>>> dir(newcorpus)
['CorpusView', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_encoding', '_fileids', '_get_root', '_para_block_reader', '_read_para_block', '_read_sent_block', '_read_word_block', '_root', '_sent_tokenizer', '_tag_mapping_function', '_word_tokenizer', 'abspath', 'abspaths', 'encoding', 'fileids', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'words']
# POS tagging all the words in all text files at the same time.
>>> newcorpus.words()
['this', 'is', 'a', 'foo', 'foo', 'bar', 'bar', '.\\', ...]
>>> pos_tag(newcorpus.words())
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar', 'NN'), ('.\\', ':'), ('n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.'), ('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top