Identify all Substantives in a Text

https://stackoverflow.com/questions/19243162

30-06-2022
|

题

I am currently training for the ISTQB Testmanager. For this purpose i would like to use ANKI and its cloze deletions.

I would like to generate the Flashcards automatically, namely via a Python script. This script should replace all substantives with a cloze deletion.

My question is:

How do i identify substantives in a text with a python script?

Unfortunately the syllabus is not available in German. German has the big advantage that substantives are capitalized.

解决方案

Look at parsing or POS tagging (POS=part of speech e.g. verbs, nouns)

pattern and NLTK provide packages for that.

An example from pattern:

>>> from pattern.en import parse
>>> print parse('I eat pizza with a fork.')

I/PRP/B-NP/O eat/VBD/B-VP pizza/NN/B-NP/O with/IN/B-PP/B-PNP a/DT/B-NP/I-PNP
fork/NN/I-NP/I-PNP ././O/O

An example from NLTK:

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

Once you have the information on which ones are substantives or nouns (which have POS tags starting with N usually), you can do cloze deletions on them. Note that POS-tagging isn't perfect, so performance will depend on how complete the text you're working on is. (I'm also assuming you're working in English, but there are POS taggers for many languages.)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow