HTMLFilter is a module for Python
programs. It parses an HTML 4 document, allowing subclasses to
pass through or modify text and tags as the event stream
goes by, and write out a copy that will be an
otherwise exact replica of the original, including whitespace
and comments. Minor errors in the markup will pass right
through without causing indigestion, and ASP, PHP, JSP, and
other server-side code will
generally survive the round trip. (The only exception can be
if it’s embedded inside an HTML tag you’re actually
modifying, not just passing through.)
The use can be as simple as adding a <meta> tag to an
existing web page, or as complex as merging two HTML pages (as
it’s used in ShearerSite,
which intelligently merges content pages into template pages).
You can also use it to generate HTML from scratch, with HTMLFilter
taking care of the attribute encoding for tags.
Other ways HTMLFilter has been used: as the engine for an HTTP-
proxy-like CGI that fixed markup errors and updated links
in another vendoṟs web application,
and in WebCarbon to modify a Web form to contain the user’s
entered values.
Tags are parsed lazily, for efficiency in the common case where the
program is only interested in passing through a tag, not reading or
modifying attributes.
Documentation
HTMLFilter is intended to be subclassed, and subclasses can output an exact replica of the original or modify
specific elements or attributes.
Normally, a user would instantiate such a subclass, then call feedString(originalHTML), then call close().
The subclass would override the handleXXX methods to perform the
filtering, and override collectHTML() if it wanted to store the generated
data. Subclasses that only wanted to read the file and not output
a modified version wouldn't need to override collectHTML().
The handleXXX methods are overridden through subclassing the main
HTMLFilter class, rather than implementing some kind of HTMLHandler
interface, so that new handleXXX methods can be added to this base class
with default implementations that provide backwards compatibility.
(+++: could split off HTMLHandler class if it were used as a base class.)
Data flow through HTMLFilter methods:
feedString(originalHTML)
-> multiple calls to handle[Text|Tag|Script|Comment|...](tag...)
(subclasses will override to observe or modify the HTML code)
-> collectHTML(html)
(subclasses can store the pieces of the final HTML code)
Has partial support of server-side scripting tags (ASP, PHP, JSP)--
they work anywhere an HTML tag would work, but HTML tags with
embedded code may not be parseable (for instance, if a tag
contains ASP code inside an attribute value, subclasses can only
reliably pass the whole tag through unmodified, not read or modify
the attributes).
Does not support SGML short tag forms (which aren't normally used
or parsed in HTML anyway, and the HTML RFC warns about this).
If a subclass doesn't override a handleXXX method, the default
implementations will pass the data to collectHTML() so that the original
HTML code is preserved. New handleXXX methods added in the future will
therefore be backwards compatible with older sublcasses, so that file
filters never lose text.
HTMLFilter has been successfully tested with versions of Python
ranging from 1.5.2 to 2.3. It’s Unicode-savvy; the source encoding can
be set, and HTML decoding respects Unicode entities.
It is distributed under a Python license.
Example (test script for HTMLTag objects)
>>> import HTMLFilter
>>> tag = HTMLFilter.HTMLFilter.HTMLTag('option')
>>> print tag.getHTML()
'<option>'
>>> tag['value'] = '"This & that"'
>>> tag.getHTML()
'<option value=""This & that"'
>>> tag['value']
'"This & that"'
>>> tag.setBooleanAttribute('selected', 1)
>>> tag.getHTML()
'<option value=""This & that"" selected>'
>>> tag['selected']
'selected'
>>> tag.setBooleanAttribute('Selected', 0)
>>> tag.getHTML()
'<option value=""This & that"">'
>>> tag['selected']
None
>>> tag.setBooleanAttribute('selected', 1)
>>> tag.getHTML()
'<option value=""This & that"" selected>'
>>> del tag['VaLUE']
>>> tag.getHTML()
'<option selected>'
>>> HTMLFilter.HTMLDecode('”') == unichr(8221)
True
>>> HTMLFilter.HTMLDecode('one”two”')
'one”two”'
Download
[View/Download HTMLFilter] Version 1.1; 36K