Question

I'm new to python and struggle with data types concept and their conversions.

I have sentences in NLTK Tree format (obtained from Stanford parser and converted to an NLTK tree). I need to apply functions written for NLTK Chunker. However, NLTK tree format is different from NLTK Chunker format. Both formats are NLTK trees, but elements structure seems to be different (see below).

Could you please help me to convert an NLTK tree to an NLTK Chunker output format?

Thanks in advance!

Here is an NLTK Chunker output:

(S
  (NP Pierre/NNP Vinken/NNP)
  ,/,
  (NP 61/CD years/NNS old/JJ)
  ,/,
  will/MD
  join/VB
  (NP the/DT board/NN)
  as/IN
  (NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
  ./.)

Now printed by element and each element type:

class 'nltk.tree.Tree' (NP Pierre/NNP Vinken/NNP)
type 'tuple' (',', ',')
class 'nltk.tree.Tree' (NP 61/CD years/NNS old/JJ)
type 'tuple' (',', ',')
type 'tuple' ('will', 'MD')
type 'tuple' ('join', 'VB')
class 'nltk.tree.Tree' (NP the/DT board/NN)
type 'tuple' ('as', 'IN')
class 'nltk.tree.Tree' (NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
type 'tuple' ('.', '.')

Here is an NLTK "pure" Tree output (exactly as in NLTK doc):

(S
  (NP
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director) (NNP Nov.) (CD 29)))
      ))
  (. .))

Now printed by element and each element type:

class 'nltk.tree.Tree' (NP
  (NP (NNP Pierre) (NNP Vinken))
  (, ,)
  (ADJP (NP (CD 61) (NNS years)) (JJ old))
  (, ,))
class 'nltk.tree.Tree' (NP (NNP Pierre) (NNP Vinken))
class 'nltk.tree.Tree' (NNP Pierre)
type 'str' Pierre
class 'nltk.tree.Tree' (NNP Vinken)
type 'str' Vinken
class 'nltk.tree.Tree' (, ,)
type 'str' ,
class 'nltk.tree.Tree' (ADJP (NP (CD 61) (NNS years)) (JJ old))
class 'nltk.tree.Tree' (NP (CD 61) (NNS years))
class 'nltk.tree.Tree' (CD 61)
type 'str' 61
class 'nltk.tree.Tree' (NNS years)
type 'str' years
class 'nltk.tree.Tree' (JJ old)
type 'str' old
class 'nltk.tree.Tree' (, ,)
type 'str' ,
class 'nltk.tree.Tree' (VP
  (MD will)
  (VP
    (VB join)
    (NP (DT the) (NN board))
    (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
    (NP (NNP Nov.) (CD 29))))
class 'nltk.tree.Tree' (MD will)
type 'str' will
class 'nltk.tree.Tree' (VP
  (VB join)
  (NP (DT the) (NN board))
  (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
  (NP (NNP Nov.) (CD 29)))
class 'nltk.tree.Tree' (VB join)
type 'str' join
class 'nltk.tree.Tree' (NP (DT the) (NN board))
class 'nltk.tree.Tree' (DT the)
type 'str' the
class 'nltk.tree.Tree' (NN board)
type 'str' board
class 'nltk.tree.Tree' (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
class 'nltk.tree.Tree' (IN as)
type 'str' as
class 'nltk.tree.Tree' (NP (DT a) (JJ nonexecutive) (NN director))
class 'nltk.tree.Tree' (DT a)
type 'str' a
class 'nltk.tree.Tree' (JJ nonexecutive)
type 'str' nonexecutive
class 'nltk.tree.Tree' (NN director)
type 'str' director
class 'nltk.tree.Tree' (NP (NNP Nov.) (CD 29))
class 'nltk.tree.Tree' (NNP Nov.)
type 'str' Nov.
class 'nltk.tree.Tree' (CD 29)
type 'str' 29
class 'nltk.tree.Tree' (. .)
type 'str' .
Was it helpful?

Solution

Partial answer (i.e., no code):

The NLTK represents chunked data using the Tree class, which is really designed for arbitrary syntactic trees. A chunked sentence is a tree with just one level of grouping, so to go from a full parse to a chunked structure you need to discard all but one kind of non-recursive groups. Which groups? That depends on your application, since there are different kinds of "chunks" (e.g., named entities).

Your example shows NP chunks, so you could walk the tree and omit all structure except for the top level of NP (or the lowest level, if you want to break up complex NP chunks into small ones).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top