Jan 2, 2014

Parsing HTML code like XML as DOMTree?? -> Use htmlcleaner

I had the problem, to parse HTML-Emails in Java, where the tags were not balanced and so on.

To get the same features with the DOMTree as with parsing of XML, I searched for a tool and found it
in htmlcleaner.

Htmlcleaner converts the HTML-Code to a DOMTree, which allows parsing and searching with XPath.

This is a nice way for extracting data from the email message. In my case there was every data in the same tag, but the signature used a different font attribute, which allowed the distinction in the XPath searching.