Exciting things with XML
|So XML exists, and you probably have to deal with it occasionally.
It’s a bit more complex than a CSV or a fixed-width file, so an enormous number of people have created an even greater number of toolkits to help you read and write the things.
What the industry standard appears to be is to create a toolkit that implements half of the XML standard, and then drop that and start another which has a different, but overlapping subset of supported features.
If you’re a member of the W3C consortium, you probably want to have several different XML implementations, with a whole raft of SAXTransformerFactory‘s and DOMImplementationRegistry‘s just to work out which implementation to use, none of which really work, but that’s the sort of thing that keeps committees busy.
Why not use XmlUtil ? I know I certainly do. Especially in that example from a few weeks ago.
It has the following methods, which I will explain briefly in the bit below this bit.
getCleanXml
getText
getTextPreserveElements
getTextNonRecursive
getXmlString
toDocument
compact
processContentHandler
static inner class SimpleTableContentHandler
static inner class AbstractStackContentHandler
static inner class NodeListIterator
getCleanXml
This method runs a document through tagsoup to get an ‘XML-clean’ document.
A fairly common use-case for me is having some a fairly horrible looking bit of HTML taken from the wastelands of the internet, and wanting to extract some data from it.
The tagsoup library does a fairly decent job of cleaning up things that look like HTML so that it starts to approach something that more closely resembles XML.
It’s a useful first step before performing XPath-style data extraction, or before attempting to modify a webpage to pass the W3C validator. Which is the sort of thing that SEO companies think gives you google brownie points.
I have my own views on search-engine optimisation, which are that:
- google knows what you’re searching for
- google knows when you stop searching for it
- google probably knows therefore whether a website is a good result for a particular search query or not
which, you’ll notice, doesn’t involve the phrase ‘meta tags’ anywhere in it.
You can send my $20,000 fee to the usual account.
On a slightly more conspiratorial note,
- google is aware of your IP address
- google is therefore more likely to show you your own ‘ads’ that you’ve purchased on it’s network
- which is therefore more likely to get you to throw more money at google seeing as it’s doing such a good job getting your extra-specially-important site at the top of those search results
That was a bit of a longer side-rant than I thought it would be. I might move this bit into a separate blog entry. Maybe. Later.
So anyway, this is how you’d use this getCleanXml
thing:
// uses the Scanner trick from http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner.html String rubbishHtml = new Scanner(new URL("http://www.microsoft.com").openStream(), "UTF-8").useDelimiter("\\A").next(); String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);Document d = XmlUtil.toDocument(cleanXml); Element titleEl = (Element) XPathAPI.selectSingleNode(d, "/html/head/title"); |
getText
getText()
returns the text between an opening and closing element in an XML document.
So extending the example above:
// uses the Scanner trick from http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner.html String rubbishHtml = new Scanner(new URL("http://www.microsoft.com").openStream(), "UTF-8").useDelimiter("\\A").next(); String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false); Document d = XmlUtil.toDocument(cleanXml); Element titleEl = (Element) XPathAPI.selectSingleNode(d, "/html/head/title"); String title = XmlUtil.getText(titleEl);System.out.println(title); |
currently generates
Microsoft Australia | Devices and Services
with nary a regex in sight.
getTextPreserveElements
Same as getText()
, but allows a predefined list of tags to be returned as well.
This is useful when processing formatted text in which you might want to keep certain tags (B, I, IMG etc)
String input = "<p>Here is some formatted text <b>in bold</b>, <i>in italics</i>," + " <blink>blinking</blink>, and with an image at the end <img src=\"frog.png\"/></p>"; Document d = XmlUtil.toDocument(input); Element paraEl = d.getDocumentElement(); System.out.println(XmlUtil.getText(paraEl)); System.out.println(XmlUtil.getTextPreserveElements(paraEl, new String[] { "b", "i", "img" } )); |
which would give you something like
Here is some formatted text in bold, in italics, blinking, and with an image at the end Here is some formatted text <b>in bold</b>, <i>in italics</i>, blinking, and with an image at the end <img src="frog.png"></img> |
getTextNonRecursive
Same as getText(), but doesn’t attempt to recurse into child elements within the parent XML element
String input = "<p>This is the bit we're interested in <span>but not this bit</span></p>"; Document d = XmlUtil.toDocument(input); Element paraEl = d.getDocumentElement(); System.out.println(XmlUtil.getText(paraEl)); System.out.println(XmlUtil.getTextNonRecursive(paraEl)); |
This is the bit we're interested in but not this bit This is the bit we're interested in |
getXmlString
Converts a Document into a String. As in the following example, which attempts to extract the ‘keywords’ and ‘description’ meta tag elements from google’s home page, and send them to stdout.
String rubbishHtml = new Scanner(new URL("http://www.google.com").openStream(), "UTF-8").useDelimiter("\\A").next(); String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false); Document d = XmlUtil.toDocument(cleanXml); NodeList nodes = XPathAPI.selectNodeList(d, "/html/head/meta"); for (int i=0; i<nodes.getLength(); i++) { System.out.println(i + ": " + XmlUtil.getXmlString(nodes.item(i), true)); } |
which currently, and not too surprisingly, generates:
0: <meta content="/images/google_favicon_128.png" itemprop="image"/> |
It’s amazing that anyone can find them, really.
toDocument
Converts a String into a W3C Document object
Most of the examples above use this method, so hopefully you get the idea
compact
Removes whitespace surrounding text nodes contained within an Element
Such as this attempt to generate the Cliff’s notes for All’s Well That End’s Well:
// from http://www.ibiblio.org/xml/examples/shakespeare/all_well.xml String input = "<PERSONAE>\n" + " <TITLE>Dramatis Personae</TITLE>\n" + " \n" + " <PERSONA>KING OF FRANCE</PERSONA>\n" + " <PERSONA>DUKE OF FLORENCE</PERSONA>\n" + " <PERSONA>BERTRAM, Count of Rousillon.</PERSONA>\n" + " <PERSONA>LAFEU, an old lord.</PERSONA>\n" + " <PERSONA>PAROLLES, a follower of Bertram.</PERSONA>\n" + " \n" + " <PGROUP>\n" + " <PERSONA>Steward</PERSONA>\n" + " <PERSONA>Clown</PERSONA>\n" + " <GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR>\n" + " </PGROUP>\n" + " \n" + " <PERSONA>A Page. </PERSONA>\n" + " <PERSONA>COUNTESS OF ROUSILLON, mother to Bertram. </PERSONA>\n" + " <PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA>\n" + " <PERSONA>An old Widow of Florence. </PERSONA>\n" + " <PERSONA>DIANA, daughter to the Widow.</PERSONA>\n" + " \n" + " <PGROUP>\n" + " <PERSONA>VIOLENTA</PERSONA>\n" + " <PERSONA>MARIANA</PERSONA>\n" + " <GRPDESCR>neighbours and friends to the Widow.</GRPDESCR>\n" + " </PGROUP>\n" + " \n" + " <PERSONA>Lords, Officers, Soldiers, &c., French and Florentine.</PERSONA>\n" + "</PERSONAE>\n"; Document d = XmlUtil.toDocument(input); System.out.println("Before compact():"); System.out.println(XmlUtil.getXmlString(d, true)); XmlUtil.compact(d.getDocumentElement());System.out.println(); System.out.println("And after compact(), the much more readable:"); System.out.println(XmlUtil.getXmlString(d, true)); |
which generates:
Before compact():
<PERSONAE>
<TITLE>Dramatis Personae</TITLE>
<PERSONA>KING OF FRANCE</PERSONA>
<PERSONA>DUKE OF FLORENCE</PERSONA>
<PERSONA>BERTRAM, Count of Rousillon.</PERSONA>
<PERSONA>LAFEU, an old lord.</PERSONA>
<PERSONA>PAROLLES, a follower of Bertram.</PERSONA>
<PGROUP>
<PERSONA>Steward</PERSONA>
<PERSONA>Clown</PERSONA>
<GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR>
</PGROUP>
<PERSONA>A Page. </PERSONA>
<PERSONA>COUNTESS OF ROUSILLON, mother to Bertram. </PERSONA>
<PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA>
<PERSONA>An old Widow of Florence. </PERSONA>
<PERSONA>DIANA, daughter to the Widow.</PERSONA>
<PGROUP>
<PERSONA>VIOLENTA</PERSONA>
<PERSONA>MARIANA</PERSONA>
<GRPDESCR>neighbours and friends to the Widow.</GRPDESCR>
</PGROUP>
<PERSONA>Lords, Officers, Soldiers, &c., French and Florentine.</PERSONA>
</PERSONAE>
And after compact(), the much more readable:
<PERSONAE><TITLE>Dramatis Personae</TITLE><PERSONA>KING OF FRANCE</PERSONA><PERSONA>DUKE OF FLORENCE</PERSONA><PERSONA>BERTRAM, Count of Rousillon.</PERSONA><PERSONA>LAFEU, an old lord.</PERSONA><PERSONA>PAROLLES, a follower of Bertram.</PERSONA><PGROUP><PERSONA>Steward</PERSONA><PERSONA>Clown</PERSONA><GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR></PGROUP><PERSONA>A Page.</PERSONA><PERSONA>COUNTESS OF ROUSILLON, mother to Bertram.</PERSONA><PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA><PERSONA>An old Widow of Florence.</PERSONA><PERSONA>DIANA, daughter to the Widow.</PERSONA><PGROUP><PERSONA>VIOLENTA</PERSONA><PERSONA>MARIANA</PERSONA><GRPDESCR>neighbours and friends to the Widow.</GRPDESCR></PGROUP><PERSONA>Lords, Officers, Soldiers, &c., French and Florentine.</PERSONA></PERSONAE>
processContentHandler
Runs a SAX ContentHandler over a Document
You need a ContentHandler in order to process it, so why not keep reading to see an example of one of those.
Class SimpleTableContentHandler
A simple ContentHandler that parses data from a HTML table (into a List of Lists).
The example here parses the currency exchange tables from xe.com, and selects a currency using polar method of G. E. P. Box, M. E. Muller, and G. Marsaglia, as described by Donald E. Knuth in The Art of Computer Programming, Volume 3: Seminumerical Algorithms, section 3.4.1, subsection C, algorithm P.
It then goes on tell you whether you’d have made a profit or loss by selling that currency 12 months later.
public static List<List<String>> getCurrencyRates(String date) throws MalformedURLException, IOException, SAXException, TransformerException { // this doesn't actually work due to http://www.xe.com/errors/noautoextract.htm // but I'm sure you enterprising types can get around that String url = "http://www.xe.com/currencytables/?from=AUD&date=" + date; String rubbishHtml = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next(); // hint: // String rubbishHtml = Text.getFileContents(date.equals("2012-01-01") ? "c:\\rate1.txt" : "c:\\rate2.txt"); String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false); Document d = XmlUtil.toDocument(cleanXml); Element tableEl = (Element) XPathAPI.selectSingleNode(d, ".//table[@id=\"historicalRateTbl\"]"); XmlUtil.SimpleTableContentHandler stch = new XmlUtil.SimpleTableContentHandler(); XmlUtil.processContentHandler(stch, XmlUtil.getXmlString(tableEl, false)); List<List<String>> table = stch.getTable(); return table; } // see what amount of abstract currency units you can get in exchange for // the shiny Australian cylinder with a kangaroo on it. List<List<String>> startTable = getCurrencyRates("2012-01-01"); List<List<String>> endTable = getCurrencyRates("2013-01-01"); // pick an investment strategy int tradingOption = (int) (Math.random().nextGaussian() * startTable.size()); // sanity check that the row with the same index in each table contains the same currency if (!startTable.get(tradingOption).get(0).equals(endTable.get(tradingOption).get(0))) { throw new RuntimeException("Rate tables don't line up"); } System.out.println("You're investing in the " + startTable.get(tradingOption).get(1)); System.out.println("Of which you could pick up " + startTable.get(tradingOption).get(2) + " for a measly AUD$1 way back in 2012"); System.out.println("and then sell for AUD$" + endTable.get(tradingOption).get(3) + " each barely a year later"); // and then let's pretend that xe.com is accurate, and that rounding, // buy/sell prices and transaction costs don't exist BigDecimal bd = new BigDecimal(startTable.get(tradingOption).get(2)); BigDecimal bd2 = new BigDecimal(endTable.get(tradingOption).get(3)); BigDecimal val = bd.multiply(bd2); System.out.println("landing you AUD$" + val); System.out.println(val.compareTo(new BigDecimal(1))>0 ? "Winner!" : "You lose!"); |
which generates:
You're investing in the Belizean Dollar Of which you could pick up 2.0266687920 for a measly AUD$1 way back in 2012 and then sell for AUD$0.4810598095 each barely a year later landing you AUD$0.97494890299911512400 You lose! |
Class AbstractStackContentHandler
Similar to the Apache Digester, but much much smaller, this ContentHandler creates a ‘stack’ of elements separated by a slash, which it passes through to the concrete implementation.
This example parses a device.xml file (as used in DMX-web, an example of which is attached to the foot of this blog entry), and lists the name of all the devices contained within.
public static class DeviceContentHandler extends XmlUtil.AbstractStackContentHandler { List<String> names = new ArrayList<String>(); // process the start of an XML element public void element(String path) throws SAXException { } // process the text of an XML element public void elementText(String path, String content) throws SAXException { if (path.equals("devices/device/name")) { names.add(content); } } // return the names of all devices public List<String> getNames() { return names; } } DeviceContentHandler dch = new DeviceContentHandler(); XmlUtil.processContentHandler(dch, Text.getFileContents("c:\\device.xml")); List<String> names = dch.getNames(); for (int i=0; i<names.size(); i++) { System.out.println(i + ": " + names.get(i)); } |
which generates:
0: DMXKing USB 1: Art-Net 2: WinAMP Controller 3: Ye olde dmxy plugigne |
Class NodeListIterator
A wrapper for a NodeList that makes it Iterable
Here’s an example using the WindowTreeDom object from a couple of weeks back
// get the Windows window tree in an XML object and attempt to find the Outlook windows in there WindowTreeDom wtd = new WindowTreeDom(); Document windows = wtd.getDom(); NodeList outlookWindows = XPathAPI.selectNodeList(windows, ".//window[@class='rctrl_renwnd32']/window[@class='AfxWndW']/window[@class='#32770']"); for (Node n : new XmlUtil.NodeListIterator(outlookWindows)) { Element e = (Element) n; logger.debug("Found possible window " + XmlUtil.getXmlString(n, true)); } |
Which could come in handy to prove that your commercialised electronic marketing campaigns look readable in all three types of email client that exist.
So I used to have some more code here, but have moved all that to github. Well some of it. The classes are here:
* XmlUtil.java
* XmlUtilTest.java
And here’s the github repos and the maven site docs:
Update 2013-09-25: It’s in central now
Update 2021-01-29: It’s in github now