Package nltk_lite :: Package parse :: Module treetransforms
[hide private]
[frames] | no frames]

Module treetransforms

source code


A collection of methods for tree (grammar) transformations used
in parsing natural language.  

Although many of these methods are technically grammar transformations
(ie. Chomsky Norm Form), when working with treebanks it is much more
natural to visualize these modifications in a tree structure.  Hence,
we will do all transformation directly to the tree itself.
Transforming the tree directly also allows us to do parent annotation.
A grammar can then be simply induced from the modified tree.

The following is a short tutorial on the available transformations.

1) Chomsky Normal Form (binarization)

  It is well known that any grammar has a Chomsky Normal Form (CNF)
  equivalent grammar where CNF is defined by every production having
  either two non-terminals or one terminal on its right hand side.
  When we have hierarchically structured data (ie. a treebank), it is
  natural to view this in terms of productions where the root of every
  subtree is the head (left hand side) of the production and all of
  its children are the right hand side constituents.  In order to
  convert a tree into CNF, we simply need to ensure that every subtree
  has either two subtrees as children (binarization), or one leaf node
  (non-terminal).  In order to binarize a subtree with more than two
  children, we must introduce artificial nodes.

  There are two popular methods to convert a tree into CNF: left
  factoring and right factoring.  The following example demonstrates
  the difference between them.

  |          Original       Right-Factored     Left-Factored
  |
  |  Example:     A              A                      A 
  |             / | \          /   \                  /     |            B  C  D   ==>  B    A|<C-D>   OR   A|<B-C>  D 
  |                                 /  \          /    |                                C    D        B    C

2) Parent Annotation

  In addition to binarizing the tree, there are two standard
  modifications to node labels we can do in the same traversal: parent
  annotation and Markov order-N smoothing (or sibling smoothing).

  The purpose of parent annotation is to refine the probabilities of
  productions by adding a small amount of context.  With this simple
  addition, a CYK (inside-outside, dynamic programming chart parse)
  can improve from 74% to 79% accuracy.  A natural generalization from
  parent annotation is to grandparent annotation and beyond.  The
  tradeoff becomes accuracy gain vs. computational complexity.  We
  must also keep in mind data sparcity issues.

  |          Original       Parent Annotation 
  |
  |  Example:     A                A^<?>            
  |             / | \             /   \            
  |            B  C  D   ==>  B^<A>    A|<C-D>^<?>     where ? is the parent of A
  |                                      /  \        
  |                                  C^<A>   D^<A>   


3) Markov order-N smoothing

  Markov smoothing combats data sparcity issues as well as decreasing
  computational requirements by limiting the number of children
  included in artificial nodes.  In practice, most people use an order
  2 grammar.

  |             Original          No Smoothing        Markov order 1     Markov order 2   etc...
  |
  |  Example:       A                A                       A                A 
  |            /  / | \  \         /   \                   /   \            /     |           B  C  D  E  F  ==>  B    A|<C-D-E-F>   ==>  B   A|<C>  ==>   B  A|<C-D>
  |                                      /   \                /   \            /     |                                     C    ...             C    ...         C    ...

  Annotation decisions can be thought about in the vertical direction
  (parent, grandparent, etc) and the horizontal direction (number of
  siblings to keep).  Parameters to the following functions specify
  these values.  For more information see:

  Dan Klein and Chris Manning (2003) "Accurate Unlexicalized Parsing", ACL-03.
  http://www.aclweb.org/anthology/P03-1054
    
4) Unary Collapsing  

  Collapse unary productions (ie. subtrees with a single child) into a
  new non-terminal (Tree node).  This is useful when working with
  algorithms that do not allow unary productions, yet you do not wish
  to lose the parent information.

  |              A         
  |              |
  |  Example:    B   ==>   A+B
  |             / \        /   |            C   D      C   D    

Functions [hide private]
 
chomskyNormalForm(tree, factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')
This method can modify a tree in three ways: 1.
source code
 
unChomskyNormalForm(tree, expandUnary=True, childChar='|', parentChar='^', unaryChar='+')
This method modifies the tree in three ways:
source code
 
collapseUnary(tree, collapsePOS=True, collapseRoot=True, joinChar='+')
Collapse subtrees with a single child (ie.
source code
 
toTreebank(tree)
Convert a tree into its treebank-style bracketed equivalent.
source code
 
_toTreebank(tree) source code
 
demo()
A demonstration showing how each tree transform can be used.
source code
Function Details [hide private]

chomskyNormalForm(tree, factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')

source code 

This method can modify a tree in three ways:
1. Convert a tree into its Chomsky Normal Form (CNF) equivalent -- Every subtree
has either two non-terminals or one terminal as its children.  This process
requires the creation of more "artificial" non-terminal nodes.
2. Markov (vertical) smoothing of children in new artificial nodes
3. Horizontal (parent) annotation of nodes
(see documentation in code for more information)
  
@param tree: The Tree to be modified
@type  tree: C{Tree}
@param factor: Right or left factoring method (default = "right")
@type  factor: C{string} = [left|right]
@param horzMarkov: Markov order for sibling smoothing in artificial nodes (None (default) = include all siblings)
@type  horzMarkov: C{int} | None
@param vertMarkov: Markov order for parent smoothing (0 (default) = no vertical annotation)
@type  vertMarkov: C{int} | None
@param childChar: A string used in construction of the artificial nodes, separating the head of the
                  original subtree from the child nodes that have yet to be expanded (default = "|")
@type  childChar: C{string}
@param parentChar: A string used to separate the node representation from its vertical annotation
@type  parentChar: C{string}

unChomskyNormalForm(tree, expandUnary=True, childChar='|', parentChar='^', unaryChar='+')

source code 

This method modifies the tree in three ways:

  1. Transforms a tree in Chomsky Normal Form back to its original structure (branching greater than two)
  2. Removes any parent annotation (if it exists)
  3. (optional) expands unary subtrees (if previously collapsed with collapseUnary(...) )
Parameters:
  • tree (Tree) - The Tree to be modified
  • expandUnary (boolean) - Flag to expand unary or not (default = True)
  • childChar (string) - A string separating the head node from its children in an artificial node (default = "|")
  • parentChar (string) - A sting separating the node label from its parent annotation (default = "^")
  • unaryChar (string) - A string joining two non-terminals in a unary production (default = "+")

collapseUnary(tree, collapsePOS=True, collapseRoot=True, joinChar='+')

source code 

Collapse subtrees with a single child (ie. unary productions) into a new non-terminal (Tree node) joined by 'joinChar'. This is useful when working with algorithms that do not allow unary productions, and completely removing the unary productions would require loss of useful information. The Tree is modified directly (since it is passed by reference) and no value is returned.

Parameters:
  • tree (Tree) - The Tree to be collapsed
  • collapsePOS (boolean) - 'False' (default) will not collapse the parent of leaf nodes (ie. Part-of-Speech tags) since they are always unary productions
  • collapseRoot (boolean) - 'False' (default) will not modify the root production if it is unary. For the Penn WSJ treebank corpus, this corresponds to the TOP -> productions.
  • joinChar (string) - A string used to connect collapsed node values (default = "+")