The first in an occasional series
Intro: A while back I wrote a script to analyze an iTunes library and find your favorite artists, albums, etc. It works pretty well and I regularly use it to update my own analysis. Unfortunately, it generally takes a long time to run, which is sort of OK for me (because I just start it running and go do something else) but less good for people who are running the analysis through the web site.
So I’d like to make it run faster, and I have a number of ideas to do so.
Idea: There are two main parts to the system – parsing the iTunes Music Library.xml file into a database, and running the analysis on the database. First I’m focusing on the parsing part.
After profiling the first version by running python -m cProfile -o profiledata.oldway iTunesInfo.py "iTunes Music Library.xml", I see that the whole parsing process takes 114 seconds. The major parts of this are 60 seconds for the xml.dom.minidom.parse method and 46 seconds for the database operations. Note that this only leaves ~8 seconds for figuring out the track information – clearly this is not the bottleneck!
So I’d like to improve parsing speed. There are two basic kinds of XML parsers – what we’re using now is a DOM or Document Object Model-style parser, which means that the parser reads the entire file in and returns a parsed structure containing all the data. (I remember writing a simple XML parser that did this as a project in COMP 314. Ah, memories…) The advantage to this method is that after the parsing is done, it’s easy to traverse the DOM tree and find the data that you’re interested in. The downside is that, well, it’s slow. Also, the entire document has to be read into memory which means that your memory usage is proportional to the size of the file you’re processing, which adds to the slowness and can lead to out of memory problems on huge files (although we weren’t seeing that here).
The other basic kind of XML parser is known as SAX, or Simple API for XML. You provide callback functions that are called whenever the parser runs across the start of a tag, end of a tag, character data, and…that’s it. Whatever processing you want to do you have to do in those callback functions. So if you’re just, say, counting the number of <key> tags in a document this works really well. It’s also much faster than the DOM-style parser, since it doesn’t have to generate a giant tree structure. But doing the sorts of processing we’re doing on the library file seems a bit more tricky.
Anyway, I take a stab at it, and after a bit end up with version 2 of the script. Notice that the logic in the Handler class is a bit twisted – we have to keep track of where we are in the document (so if things get out of order we’ll have problems) and use a state-based system which is a bit brittle and unclear.
But how does it perform? The old version of the script ran in 114 seconds, and this version runs in 71 seconds for a ~60% increase in speed. But really, it’s better than that, because the database operations still take around 50 seconds – if we subtract that from both we get 64 seconds versus 21 seconds which is a ~200% increase in the speed of the parsing.
Conclusion: This was a big success! Most of the time is now in the database layer, which I have some ideas for speeding up next time.