XML in Python – What If You Need XPath?
One of the lovely things about Python is that there are so many free libraries to choose from. But sometimes that’s a bad thing, because people love to reinvent the wheel, thinking that they can make it somehow rounder and more efficient. This of course results in a lot of dead code and modules that haven’t been updated since the Stone Age.
Recently at work I found myself looking at a new package to replace one really old dead one: PyXML. We’ve got a good chunk of code that reads and writes data to an XML file basically as a flat-file database for when we (or our customers) are not using PostgreSQL, Oracle, etc.
Normally that wouldn’t be such a big deal, except that we have one requirement: XPath.
There are an awful lot of good XML parsers out there in the world. CPython now even comes with one build in: ElementTree. …And cElementTree (the compiled version of ElementTree) which is the same API, just a whole lot faster – unless you’re using PyPy, which we’re not – but I digress.
The problem with ElementTree however is that it doesn’t fully support XPath. In fact, it barely does at all. It’d be nice if it did, but it doesn’t much, so there you are.
The following is a run-down of my research into a few other Python libraries that do fully support XPath XML coding standards. I wish I could release the benchmark code that my performance evaluations are based on, but the benchmark code is based on real use cases of proprietary code. Meaning that they also don’t necessarily represent the on-paper perfect-world performance of these libraries, but more useful real-world use-cases. These were run under CPython 2.7.3 on a Windows 7 64-bit Intel Xeon workstation.
You’d think that PyXML, having last been updated in 2004, would be long dead. And it is. But if you don’t mind fixing some minor things in this all-Python XML library, it actually does still work. You just have to find-and-replace the two places where “as” is used as a variable name, since Python now protects keywords with a vengeance. Simple changing the name to “_as” is enough to fix the problem, and then you can continue to use the long-dead PyXML on Python 2.7.3. Since this is the library that our code used in the past for Python XPath XML support, it’s what I used as a baseline to compare other libraries to. We also used Python’s minidom with PyXML, which is not exactly known for speed…
This is another dead project, having last been updated in 2006 as far as I can tell. It’s written by the same company that wrote most of what is in PyXML. (According to Wikipedia, it’s also the same company that brought you PowerPoint?) Unfortunately their corporate website seems to be down, meaning that they’re likely just as dead as their 4 Suite package. Fortunately the great thing about places like SourceForge, besides the whole open-source thing, is that they’re also a great repository for dead packages and code.
The advantage of 4Suite is that because it was written by the same people as PyXML, 4Suite can be used with minimal code changes. It contains the PyXML API with very few differences. It just adds a whole lot more. But the one big differences is that you don’t use Python’s minidom, you use 4Suite’s cDomlette. Cute. And yes, it’s compiled code. And at least on CPython, it runs faster. It’s a little more than twice as fast as PyXML.
Finally, a living breathing project! Based on a Python wrapping of the Gnome XML parser written in C, libxml2 is a breath of fresh air. Err … sort of. There’s a nice object-oriented wrapper written in Python. Which would be good … if it were documented. But darned if I can find any API documentation. And since the libxml2 wrapper changes the API dramatically from the original C code that it came from, it takes a bit of figuring out to use. It’s also slow. Oh, sure, the two-and-a-half times the performance of PyXML seems great. It’s even better than 4Suite. Barely. But if you’ll read through the rest, you’ll see it’s not so impressive after all, and there’s actually a very useful alternative right under its nose.
Yes, that’s right, it’s still technically the same libxml2 module as above. But if you load the libxml2mod.pyd file directly and skip the object-oriented Pythonic wrapper, going straight to the literal Gnome libxml2 APIs, you’ll have a lot more programming work (as the API is a lot more effort to code to) with a much better performance of six times the speed of PyXML. And it fully supports XPath. Who could ask for more?
Well, I could, actually. I don’t know if it’s the distribution that I got, or if it’s just not fully wrapped, or what, but there were some pieces of the Gnome C-code’s API missing from the Python libxml2mod.pyd file. The largest omission to me was XPath’s compile operation was completely missing. Since this can be vital to improving performance of executing an evaluate query across multiple nodes, it makes the 6X speed improvement even more impressive, as I was forced to do things the slow way, without compile. Which can of course be done. But it just makes you wonder, because the Gnome library definitely has this API, so it’s a mystery why the libxml2.pyd file didn’t.
If you’re using Qt4 as your Python GUI, you might as well use the Qt4 XML parser … right?
Well, maybe not.
Now don’t get me wrong. I love Qt.
Or at least, I loved the Qt that Trolltech put out.
But ever since Nokia bought Qt, it’s gone downhill. Fast. And this is a perfect example, right here.
The Qt4 XML parser is the darndest most complicated pile of API I’ve ever run across. Oh, it’s highly flexible. In theory. And it fully supports XPath. … In theory. (I certainly haven’t tested every last feature.) But darned if I didn’t run into all sorts of mess just trying to convert the benchmark code to using Qt4’s XML parser. It was even worse when a bug (I don’t know if it’s in Qt4 or PyQt4) prevented me from evaluating to a QString, like you’re supposed to be able to do. So simple property lookups required the full QXmlResulItems overkill where I resolve the first result item from my results class instance, get the model index from that item, then use the model pointer from the index with the index to resolve it into a string. Instead of just getting the first string, like I’d wanted and like it should have been able to do. And not only is the API a mess (a highly flexible mess, but still a mess all the same), but it’s also two and a half times slower than PyXML. I honestly didn’t even think that it would be possible to write an XML parser slower for CPython than a pure-Python implementation that uses a DOM no less. Surely the C++ compiled-code PyQt4 would have a much faster XML parser than PyXML, right?
Well, apparently not! As my benchmarks showed.
It was slow. Really slow.
Three-legged horse at the racetrack slow!
So I would highly suggest, to anyone using Qt4, DO NOT USE QT4’s XML PARSER! It’s that bad. To code for, and in performance. Find yourself another library for your XML needs. Trust me, you’ll be much happier that way.
I can only hope now that Digia owns Qt that some of these horrendous trainwrecks that have plagued Qt4 can finally be sorted out over time. Not likely to be seen in Qt5 though, as that’s still Nokia’s aborted afterbirth. Digia probably won’t get things straightened out until Qt6. And goodness knows how many years away that could end up being!
It’s hard to believe that with these lovely landmines in Qt that I still love it. But the thing is, as bad as some parts of Qt are, no one has ever come close to doing anything better as an all-around solution to platform independent computer programming. I just wish the original integrity of Trolltech had even remotely carried on to Nokia. I just hope that Digia can give back some of the polish that Qt once had.
So I know, I already said that ElementTree doesn’t really support XPath properly yet. I really wish that it did. It’d be nice if I could just use the libraries built into Python for everything, and a good XML parser seems like a no-brainer. But for whatever reason, XPath is not really a part of ElementTree. They have kind of added beginning support to XPath type evaluate strings into the ElementTree find/findall queries, but a full implementation of XPath it is not. It doesn’t even support the full XPath string standard there.
Still, at least for enough of our use case, I was able to code for ElementTree. Converting the code from a full on XPath PyXML implementation to ElementTree and its lame partial implementation of XPath-based queries wasn’t as much work as it could have been. It’s nowhere near as much work as, say, converting the code to PyQt4, or even to libxml2. Which was pleasantly surprising. It’s a nice simple API, so I can see why people love it.
And the performance? It’s about three times faster than PyXML, making it a fair improvement. For a pure-Python implementation it’s actually quite amazing to squeeze that much out. But then, there’s a reason people don’t use DOM anymore. But the real treat comes next.
And here we have a real winner! Also included in Python, it’s the same API as ElementTree, just a very well written compiled-code implementation wrapped for Python. The same code that ran my ElementTree port also ran cElementTree with only the library name changing. Exactly like it should.
And the results were astounding. The real-use-case benchmark of XML parsing was a whopping eighteen times faster than our old PyXML code. Ding-dong, the DOM is dead!
Of course the problem is, all of our existing code is written for DOM using PyXML, so it’ll take a while to convert all of that to cElementTree.
As a side note, if any PyPy enthusiasts want to know why CPython programmers can’t convert to PyPy just yet (maybe not ever) here’s the reason why. A well-wrapped compiled code library runs like a champ in CPython. As a result, a lot of us big data/number crunchers have lots of compiled code in our Python projects. And since PyPy only just barely even runs compiled code, slowing things down far worse in that than native Python code, this leaves a lot of us CPython folks out in the cold. If you want the serious data crunchers to switch to PyPy then you have to start taking that compiled code lag more seriously or else we’re never going to be able to join you in your fancy little JIT Python interpreter’s dance.
Conclusion
So if you’re a Python 2.7 programmer looking for the best XPath XML parser ever, well, if you’re staying true to XPath at least, I’d say go with libxml2.
However, if you can swing it (and you’ll need to really evaluate your code to determine this) you might be able to get away with the rather unfinished XPath implementation in cElementTree. In which case you won’t need to install any third-party package for XML parsing and you’ll get blinding performance out the asterisk. (And obviously, if you’re coming at a whole new XML parser, and you don’t need XPath at all, then go with cElementTree since it’s what everyone in Python land is using and it’s got great performance.)
Hopefully the all-Python ElementTree runs just as great on PyPy, giving the world a pretty well rounded solution.
If any ElementTree authors catch this, hey, could you please work on supporting XPath a little more seriously?
And finally, dear god of all things software, whatever you do, avoid Qt4’s XML parser like the plague! Unfortunately I can’t speak to Qt5 yet as there’s still a lot of untested theory there that, professionally, we just don’t want to even approach mucking about with yet. Let things get a few more minor version numbers under the hood and then we can re-evaluate a PyQt5 upgrade path. (Or maybe even PySide.) But even if Qt’s XML parsing gets a major performance improvement, the API is still just as likely to suck wet donkey fur for being so “flexible”. Seriously, what committee designed that API? It’s everything that you could ever need … without being anything that you’d ever want! Yeesh!

