I hate XML, but now less then before thanks to SimpleXMLParser

I admit it: I hate xml square brackets dancing orgy, even in Java.

Anyway, all is xml-ized around me. So in 2006 I developed a small XML parser based on SAX. It was a shitty dirty code for JDK 1.4 which let you parse xml stuff defining a method and forgetting about selectors, XPath,  XWing, Tie fighters and so on…

I called it UltraSmartParser, a shitty name too.
Now I have revivied it from the tomb of darkness, and dressed with fancy super powers. It is on github: https://github.com/daitangio/SimpleXMLParser

To give you a tast of its power, let look at this code:

public class WordPressExportReader extends SimpleXMLParser {
	public static void main(String[] args) throws SAXException, IOException {
		BasicConfigurator.configure();

		XMLReader sax2Parser = XMLReaderFactory.createXMLReader();
		SimpleXMLParser parser = new WordPressExportReader();
		parser.getLog().setLevel(Level.INFO);
		sax2Parser.setContentHandler(parser);
		File f = new File("c:/jjsoft/gioorgicom.wordpress.2012-08-07.xml");
		FileInputStream is = new FileInputStream(f);
		InputSource s = new InputSource(is);
		sax2Parser.parse(s);
		parser.getLog().info("DONE");
	}

	private String currentTitle,pid;

	public void do_RSS_CHANNEL_ITEM_TITLE(String title) {
		this.currentTitle = title;
	}

	// Catch <wp:post_id>1551</wp:post_id>
	public void do_RSS_CHANNEL_ITEM_POST_ID(String idz){
		pid=idz;
	}

	// Catch stuff like
	// <category domain="category"
	// nicename="software"><![CDATA[Software]]></category>
	// <category domain="series" nicename="version-control"><![CDATA[Version
	// Control]]></category>
	public void do_RSS_CHANNEL_ITEM_CATEGORY(Map catAttribs, String cdata) {
		if(catAttribs.get("domain").equals("series")){
			getLog().info(" POST:"+ pid+":"+ currentTitle+":" + cdata+ ":"+catAttribs.get("nicename"));
		}
	}

}

The orginal code targeted JDK 1.4, so it is a bit “vintage”.
The revamped revision you found on github spots:

  1. Support for attributes, missed in the original version
  2. Optimized algorithm
  3. Stored on Github, for sharing with you
  4. Better logging & class/method naming

The first version is called “karmak” because will be your path to enlightment…

 

pyparsing review

This is the sad true: parsing is boring. And writing parser is even worst.

If you can choose a scripting language for parsing you can think to do it in perl.

For this way, take a big breath and go in the black sea  of perl's funny regexp. They are funny only if you have that special love for the regular expressions.

But if you are more confortable with python, pyparser is a better solution.

Pyparser is a library written in Python, for building parser described with a BNF (Backus-Naur Form).

O'Reilly has just published a "Short Cuts" e-book written by Paul McGuire; in less then 70 pages you get a very good insight of pyparser.

Even if you are new to python, the book is very easy to read.

And if you do not know nothing about parser and Backus & Naur, you will find an easy path to understand it. Parsing is a tricky topic because of the grammar theory behind it, but for all-day work, you can follow the McGuire introduction.

After some simple example, you'll dive into a small web page parser.

It is very amazing how you can do extraction from web pages without a complex Sax parser, and using only  a very compact grammar.

After this intro examples,  the manual take us to a more complex task: a lisp-like expression language parser called S-Expression.

This example is important because complex data structure are oftern recursive as S-Expression are.

The last chapter, "Search Engine in 100 Lines of Code", is a well-written example, and show us how to build a small search-engine-grammar.

 So this e-book is a "must" if you need to do even simple parsing and you… do not want to become crazy with too regular expressions :)