Package nltk_lite :: Package contrib :: Module concord :: Class IndexConcordance
[hide private]
[frames] | no frames]

Class IndexConcordance

source code

object --+
         |
        IndexConcordance

Class that generates concordances from a list of sentences.

Uses an index for efficiency. If a SentencesIndex object is provided, it will be used, otherwise one will be constructed from the list of sentences. When generating a concordance, the supplied regular expression is used to filter the list of words in the index. Any that match are looked up in the index, and their lists of (sentence number, word number) pairs are used to extract the correct amount of context from the sentences.

Although this class also allows regular expressions to be specified for the left and right context, they are not used on the index. If only left/right regexps are provided, the class will essentially generate a concordance for every word in the corpus, then filter it with the regexps. This will not be very efficient and requires very large amounts of memory.

Instance Methods [hide private]
 
__init__(self, sentences, index=None)
Constructor.
source code
 
formatted(self, leftRegexp=None, middleRegexp='.*', rightRegexp=None, leftContextLength=3, rightContextLength=3, contextInSentences=True, contextChars=50, maxKeyLength=0, showWord=True, sort=0, showPOS=True, flipWordAndPOS=True, verbose=True)
Generates and displays keyword-in-context formatted concordance data.
source code
list
raw(self, leftRegexp=None, middleRegexp='.*', rightRegexp=None, leftContextLength=3, rightContextLength=3, contextInSentences=True, sort=0, verbose=True)
Generates and returns raw concordance data.
source code
 
format(self, source, contextChars=55, maxKeyLength=0, showWord=True, showPOS=True, flipWordAndPOS=True, verbose=True)
Formats raw concordance output produced by raw().
source code
 
_matches(self, item, leftRe, rightRe)
Private method that runs the given regexps over a raw concordance item and returns whether they match it.
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Variables [hide private]
  SORT_WORD = 0
Constant for sorting by target word.
  SORT_POS = 1
Constant for sorting by target word's POS tag.
  SORT_NUM = 2
Constant for sorting by sentence number.
  SORT_RIGHT_CONTEXT = 3
Constant for sorting by the first word of the right context.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, sentences, index=None)
(Constructor)

source code 

Constructor.

Arguments:

Parameters:
  • sentences (list) - List of sentences to create a concordance for. Sentences should be lists of (string, string) pairs.
  • index (SentencesIndex) - SentencesIndex object to use as an index. If this is not provided, one will be generated.
Overrides: object.__init__

formatted(self, leftRegexp=None, middleRegexp='.*', rightRegexp=None, leftContextLength=3, rightContextLength=3, contextInSentences=True, contextChars=50, maxKeyLength=0, showWord=True, sort=0, showPOS=True, flipWordAndPOS=True, verbose=True)

source code 

Generates and displays keyword-in-context formatted concordance data.

This is a convenience method that combines raw() and display()'s options. Unless you need raw output, this is probably the most useful method.

Parameters:
  • leftRegexp (string) - Regular expression applied to the left context to filter output. Defaults to None.
  • middleRegexp (string) - Regular expression applied to target word to filter output. Defaults to ".*" (ie everything).
  • rightRegexp (string) - Regular expression applied to the right context to filter output. Defaults to None.
  • leftContextLength (number) - Length of left context. Defaults to 3.
  • rightContextLength (number) - Length of right context. Defaults to 3.
  • contextInSentences (number) - Determines whether the context lengths arguments are in words or sentences. If false, the context lengths are in words - a rightContextLength argument of 2 results in two words of right context. If true, a rightContextLength argument of 2 results in a right context consisting of the portion of the target word's sentence to the right of the target, plus the two sentences to the right of that sentence. Defaults to False. @type contextChars number
  • contextChars - Amount of context to show. If set to less than 0, does not limit amount of context shown (may look ugly). Defaults to 55.
  • maxKeyLength (number) - Max number of characters to show for the target word. If 0 or less, this value is calculated so as to fully show all target words. Defaults to 0.
  • showWord (boolean) - Whether to show words. Defaults to True.
  • sort (integer) - Should be set to one the provided SORT constants. If SORT_WORD, the output is sorted on the target word. If SORT_POS, the output is sorted on the target word's POS tag. If SORT_NUM, the output is sorted by sentence number. If SORT_RIGHT_CONTEXT, the output is sorted on the first word of the right context. Defaults to SORT_WORD.
  • showPOS (boolean) - Whether to show POS tags. Defaults to True.
  • flipWordAndPOS (boolean) - If true, displays POS tags first instead of words (ie prints 'cc/and' instead of 'and/cc'). Defaults to False.
  • verbose (boolean) - Displays some extra status information. Defaults to False.

raw(self, leftRegexp=None, middleRegexp='.*', rightRegexp=None, leftContextLength=3, rightContextLength=3, contextInSentences=True, sort=0, verbose=True)

source code 

Generates and returns raw concordance data.

Regular expressions supplied are evaluated over the appropriate part of each line of the concordance. For the purposes of evaluating the regexps, the lists of (word, POS tag) tuples are flattened into a space-separated list of word/POS tokens (ie the word followed by '/' followed by the POS tag). A regexp like '^must/.*' matches the word 'must' with any POS tag, while one like '.*/nn$' matches any word with a POS tag of 'nn'. All regexps are evaluated over lowercase versions of the text.

Parameters:
  • leftRegexp (string) - Regular expression applied to the left context to filter output. Defaults to None.
  • middleRegexp (string) - Regular expression applied to target word to filter output. Defaults to ".*" (ie everything).
  • rightRegexp (string) - Regular expression applied to the right context to filter output. Defaults to None.
  • leftContextLength (number) - Length of left context. Defaults to 3.
  • rightContextLength (number) - Length of right context. Defaults to 3.
  • contextInSentences (number) - Determines whether the context lengths arguments are in words or sentences. If false, the context lengths are in words - a rightContextLength argument of 2 results in two words of right context. If true, a rightContextLength argument of 2 results in a right context consisting of the portion of the target word's sentence to the right of the target, plus the two sentences to the right of that sentence. Defaults to False.
  • sort (integer) - Should be set to one the provided SORT constants. If SORT_WORD, the output is sorted on the target word. If SORT_POS, the output is sorted on the target word's POS tag. If SORT_NUM, the output is sorted by sentence number. If SORT_RIGHT_CONTEXT, the output is sorted on the first word of the right context. Defaults to SORT_WORD.
  • verbose (boolean) - Displays some extra status information. Defaults to False.
Returns: list
Raw concordance ouput. Returned as a list of ([left context], target word, [right context], target word sentence number) tuples.

format(self, source, contextChars=55, maxKeyLength=0, showWord=True, showPOS=True, flipWordAndPOS=True, verbose=True)

source code 

Formats raw concordance output produced by raw().

Displays a concordance in keyword-in-context style format.

Parameters:
  • source (list) - Raw concordance output to format. Expects a list of ([left context], target word, [right context], target word sentence number) tuples. @type contextChars number
  • contextChars - Amount of context to show. If set to less than 0, does not limit amount of context shown (may look ugly). Defaults to 55.
  • maxKeyLength (number) - Max number of characters to show for the target word. If 0 or less, this value is calculated so as to fully show all target words. Defaults to 0.
  • showWord (boolean) - Whether to show words. Defaults to True.
  • showPOS (boolean) - Whether to show POS tags. Defaults to True.
  • flipWordAndPOS (boolean) - If true, displays POS tags first instead of words (ie prints 'cc/and' instead of 'and/cc'). Defaults to False.
  • verbose (boolean) - Displays some extra status information. Defaults to False.