GiPSy Scanner
GiPSy is an extensible general purpose text scanner and parser. The base scanner is intended to be subclassed and extended to provide the desired functionality.
For instance, to implement a basic Python source code scanner, refer to Python source code parser (also linked below).
The steps to extend are:
- Define any new desired
Token
subclasses; - Create a subclass of the
Parser
class, and:- Override the
__init__()
method, and:- Call the superclass
__init__()
method; - Assign pairs of container symbols (e.g. parentheses, curly braces,
square brackets) to
self._containers
as a list of two-element tuples, the elements representing the opening and closing container symbols (currently, only one-character symbols are supported. Containers are useful for implementing parsers such as numeric calculators, where the order of evaluation is changed through the use of parentheses; - Append
TokenMatch
objects representing regular expressions to identify the desired tokens to theself._token_matches
list. Initialize eachTokenMatch
object with two arguments: a compiled regular expression, and a list ofToken
class objects for each group in the regular expression to be assigned to (list only oneToken
class object when the entire regular expression is to be matched). TheTokenMatch
objects and their associated regular expressions will be processed in the order in which they are appended toself._token_matches
; and
- Call the superclass
- Override the
_get_token()
method, if necessary, if particular tokens need further processing beyond simple regular expression matching, prior to assignment.
- Override the
And you're done! Create an instance of your new class, feed it the
text to be parsed through the tokenize()
method, and obtain
the output from the read()
method. The read()
method takes three optional arguments:
html
— set toTrue
to HTML-escape the outputted token;decorated
— set toTrue
to surround the outputted token with specified text. See py2html class for an example over overriding the_set_token_decoration()
method to add HTML<span>
decorators to Python constructs to output a marked-up source code for pretty-printing; andtree
— set toTrue
to output the parsed tokens in a tree-style view.
By default, the read()
method outputs the original, unmodified
text.
View the py2html application for an example of
using the Py2HTMLParser
class in a real life setting.
You can also access the self._tlist
attribute directly in
your subclass to work with the actual list of tokens to provide additional
functionality on top of decorated printing.
Source Code and Downloads
View the GiPSy scanner source code:
Python distutils
distributions available from the Python Package Index:
Regular compressed archives available locally: