Wednesday, November 08, 2006
I Want Your SAX; JPA ResourceManager
onmouseover
or onclick
in their tags) is a bit trickier still. Getting HTML from end users who may not be terribly familiar with HTML constructs and don't always write the cleanest HTML is even more of a challenge.My rules for HTML in articles are pretty simple. You must enclose paragraphs with the <p> tag. The allowed tags are p, a, em, strong and blockquote. The only one of these that can have attributes is the <a> tag, and the only attribute I'll allow is
href
. It's fairly restrictive, but articles are mostly just paragraphs with the occasional link anyway. Anything else might interfere with the flow or style of the page.To deal with the malformed HTML question, I turned to JTidy, as I have in the past. JTidy is good at dealing with many common HTML coding problems, like dropped end tags, inlines that span blocks, and other common errors. It is not good at dealing with dropped angle brackets and dropped quote marks. For my purposes it is normally adequate. I wrote a small wrapper class around Tidy which also allows users to leave out the paragraph tags and leave blank lines to denote paragraphs... paragraph tags are inserted on their behalf. Additionally, it sets up some options:
tidy.setMakeClean(true);
tidy.setWord2000(true);
tidy.setLogicalEmphasis(true);
tidy.setDocType("strict");
tidy.setShowWarnings(false);
tidy.setQuiet(true);
tidy.setXHTML(true);
tidy.setDropEmptyParas(true);
There's also a "demoronize" method to change instances of <br /><br /> to paragraphs.
Unfortunately, Tidy does not remove 100% of the cruft; it can let attributes and tags with namespaces sneak through when you're outputting a document in XHTML or XML, which we need for proper page parsing. Tidy also won't remove arbitrary tags or attributes. A second phase of cleaning is needed.
For my second phase, I decided to try SAX. I have worked with DOM before, but the lightweight nature of SAX appealed to me for this part of the project. SAX can operate on a stream and you never need to have the entire XML document in memory. The way one commonly interacts with the SAX parser is to override the
DefaultHandler
class, implementing methods that are called by the parser when particular events occur during the parsing of a document. Typically you'd implement startDocument, startElement, endElement, characters, ignorableWhitespace
and possibly endDocument
.SAX doesn't really give you a way to alter the document it's parsing (to do so would generally require holding the document in memory, something for which you'd want to look at DOM). If you need to construct a new document, that's up to you. It does provide another mechanism for coding filters which can be chained, and that chain could certainly end with an all-purpose document-writing filter. But this is left up to you.
I decided to take the simplest approach possible and created a class called AllowedTag, which holds a tag name and the list of associated attributes I'll allow on that tag. I didn't go to the effort of differentiating between allowed and required attributes or attributes which might be mutually exclusive, leaving that work up to JTidy. I just want to strip the tags which offend me.
As SAX reads the end user's HTML, it makes calls to my implementation of
startElement
, passing in the namespace URI, fully qualified tag name, raw tag name, and a set of attributes in an Attributes
object.Then I check the tag name against my map of
AllowedTag
s. If it's allowed, I append a tag opener to my buffer and step through the Attributes
, adding those which are allowed by the AllowedTag
object. For anything which is not allowed, I record an error and increment an error count as I ignore the entity when building the new document.Characters and ignorable whitespace are simply appended to the buffer. For end tags, I perform the same check as in the start tag method, with one additional gotcha: self-closing tags like <br /> fire
startElement
and immediately fire endElement
. In order to preserve these tags as-is without changing them to something like <br></br>, it is necessary to track the last tag seen and the position in the output buffer after that tag was inserted. Then when handling an endElement
one can check to see if the buffer position has not changed and that the tag name is the same; if so, just change the last character (which will be >) to />
.Normally when operating with SAX and trying to do anything any more complicated with this, one would be pushing elements onto a stack with
startElement
and characters
and then popping data with endElement
. In this case what we're doing is so incredibly simple that a stack of depth > 1 is not necessary (the self-closing trick is functionally equivalent to a depth 1 stack).After finishing these classes (jUnit test-driven of course) I wired them into a Wicket validator, grossly abusing the validation framework by again letting it rewrite the data model if errors were found and changes were necessary (like illegal tags or attributes used) and by calling
error(String)
on the form's article body component to insert the very-specific error messages generated by the Tidy wrapper and the SAX tag cleanser.I also found a flaw last night in my plan to wrap the
EntityManagerFactory
in a singleton ResourceManager
. I'm not completely foiled, but some additional work may be needed. It can happen that an EntityManagerFactory
fails to initialize properly or may become invalid due to database connectivity problems. When this happens, that EMF refuses to cooperate any further; you have to destroy it and create a new one. So we can't blindly hold this in a singleton without doing some checks to ensure that it's still valid and functioning. The thing to do might be to move the createEntityManager
to the ResourceManager
and include the option to destroy and reconnect with some number of retries in that part of the code.Labels: software
Subscribe to Posts [Atom]