Package Martel :: Module LAX
[hide private]
[frames] | no frames]

Source Code for Module Martel.LAX

  1  """A simple way to read lists of fields from flat XML records. 
  2   
  3  Many XML formats are very simple: all the fields are needed, there is 
  4  no tree hierarchy, all the text inside of the tags is used, and the 
  5  text is short (it can easily fit inside of memory).  SAX is pretty 
  6  good for this but it's still somewhat complicated to use.  DOM is 
  7  designed to handle tree structures so is a bit too much for a simple 
  8  flat data structure. 
  9   
 10  This module implements a new, simpler API, which I'll call LAX.  It 
 11  only works well when the elements are small and non-hierarchical.  LAX 
 12  has three callbacks. 
 13   
 14    start() -- the first method called 
 15   
 16    element(tag, attrs, text) -- called once for each element, after the 
 17   
 18      element has been fully read.  (Ie, called when the endElement 
 19      would be called.)  The 'tag' is the element name, the attrs is the 
 20      attribute object that would be used in a startElement, and the 
 21      text is all the text between the two tags.  The text is the 
 22      concatenation of all the characters() calls. 
 23       
 24    end() -- the last method called (unless there was an error) 
 25   
 26  LAX.LAX is an content handler which converts the SAX events to 
 27  LAX events.  Here is an example use: 
 28   
 29    >>> from Martel import Word, Whitespace, Group, Integer, Rep1, AnyEol 
 30    >>> format = Rep1(Group("line", Word("name") + Whitespace() + 
 31    ...                             Integer("age")) + AnyEol()) 
 32    >>> parser = format.make_parser() 
 33    >>> 
 34    >>> from Martel import LAX 
 35    >>> class PrintFields(LAX.LAX): 
 36    ...     def element(self, tag, attrs, text): 
 37    ...         print tag, "has", repr(text) 
 38    ... 
 39    >>> parser.setContentHandler(PrintFields()) 
 40    >>> text = "Maggie 3\nPorter 1\n" 
 41    >>> parser.parseString(text) 
 42    name has 'Maggie' 
 43    age has '3' 
 44    line has 'Maggie 3' 
 45    name has 'Porter' 
 46    age has '1' 
 47    line has 'Porter 1' 
 48    >>> 
 49   
 50  Callbacks take some getting used to.  Many people prefer an iterative 
 51  solution which returns all of the fields of a given record at one 
 52  time.  The default implementation of LAX.LAX helps this case. 
 53  The 'start' method initializes a local variable named 'groups', which 
 54  is dictionary.  When the 'element' method is called, the information 
 55  is added to groups; the key is the element name and the value is the 
 56  list of text strings.  It's a list because the same field name may 
 57  occur multiple times. 
 58   
 59  If you need the element attributes as well as the name, use the 
 60  LAX.LAXAttrs class, which stores a list of 2-ples (text, attrs) 
 61  instead of just the text. 
 62   
 63  For examples: 
 64   
 65    >>> iterator = format.make_iterator("line") 
 66    >>> for record in iterator.iterateString(text, LAX.LAX()): 
 67    ...     print record.groups["name"][0], "is", record.groups["age"][0] 
 68    ... 
 69    Maggie is 3 
 70    Porter is 1 
 71    >>> 
 72   
 73  If you only want a few fields, you can pass the list to constructor, 
 74  as in: 
 75   
 76    >>> lax = LAX.LAX(["name", "sequence"]) 
 77    >>> 
 78   
 79  """ 
 80   
 81  import string 
 82  from xml.sax import handler 
 83   
 84  # Used to simplify the check if  
 85   
86 -class _IsIn:
87 - def __contains__(self, obj):
88 return 1
89
90 -class LAX(handler.ContentHandler, dict):
91 - def __init__(self, fields = None):
92 handler.ContentHandler.__init__(self) 93 dict.__init__(self) 94 if fields is None: 95 fields = _IsIn() 96 self.__fields = fields
97
98 - def __getattr__(self, name):
99 if name == "document": 100 return self 101 raise AttributeError(name)
102
103 - def uses_tags(self):
104 if isinstance(self.__fields, _IsIn): 105 return None 106 return self.__fields
107 108
109 - def startDocument(self):
110 self.__capture = [] 111 self.__expect = None 112 self.__pos = 0 113 self.start()
114
115 - def start(self):
116 self.clear()
117
118 - def startElement(self, tag, attrs):
119 if tag in self.__fields: 120 self.__capture.append( (tag, attrs, [], self.__pos) ) 121 self.__expect = tag
122
123 - def characters(self, s):
124 self.__pos += len(s) 125 for term in self.__capture: 126 term[2].append(s)
127
128 - def endElement(self, tag):
129 if tag == self.__expect: 130 cap, attrs, text_items, start = self.__capture.pop() 131 self.element(tag, attrs, string.join(text_items, ""), 132 start, self.__pos) 133 if self.__capture: 134 self.__expect = self.__capture[-1][0] 135 else: 136 self.__expect = None
137
138 - def element(self, tag, attrs, text, startpos, endpos):
139 self.setdefault(tag, []).append(text)
140
141 - def endDocument(self):
142 if self.__capture: 143 missing = [] 144 for term in self.__capture: 145 missing.append(term[0]) 146 raise TypeError("Looking for endElements for %s" % \ 147 string.join(missing, ",")) 148 self.end()
149
150 - def end(self):
151 pass
152 153 154 # Also stores the attributes
155 -class LAXAttrs(LAX):
156 - def element(self, tag, attrs, text, startpos, endpos):
157 self.setdefault(tag, []).append( (text, attrs) )
158 159 # Stores attributes and positions
160 -class ElementInfo:
161 - def __init__(self, text, attrs, startpos, endpos):
162 self.text = text 163 self.attrs = attrs 164 self.startpos = startpos 165 self.endpos = endpos
166
167 -class LAXPositions(LAXAttrs):
168 - def element(self, tag, attrs, text, startpos, endpos):
169 self.setdefault(tag, []).append( 170 ElementInfo(text, attrs, startpos, endpos) )
171