Tue
15
Mar '05
|
by Frank Spychalski filed under Work
|
This weeks topic is “test before you start working”
One of the projects I’m working on is a tool which reads a pretty big XML input file (3.2mb). I was looking for ways to speed up the whole thing and noticed that parts of the file look like this:
<host ... > <ipv4>ip</ipv4> <ipv4>ip</ipv4> <ipv4>ip</ipv4> … </host>
My idea was to move the list of ips into a comma seperated list in an attribute. 5 minutes and an XSLT later I had the configfile in this new format. Shrunk from 3.2mb to 2.2mb. Another 5 minutes later I had a short java programm which parses both files and measures the time in both cases. Surprising result: very litte difference, the new file parses on average less then 10% faster than the old format and great variation (as much as 100%) in the time needed. So I guess I have to search somewhere else for ways to speed the whole thing up but at least I didn’t waste much time.
edit for more info
the results using a SAXParser (the tool is using a SAXParser now):
new 2771ms old 2546ms diff 225ms new 999ms old 2887ms diff -1888ms new 1381ms old 1433ms diff -52ms new 1333ms old 1357ms diff -24ms new 1381ms old 1426ms diff -45ms new 2578ms old 2660ms diff -82ms new 1387ms old 1554ms diff -167ms new 2334ms old 2905ms diff -571ms new 2105ms old 5054ms diff -2949ms new 1308ms old 4739ms diff -3431ms new 1999ms old 1440ms diff 559ms new 1389ms old 1513ms diff -124ms new 2974ms old 1816ms diff 1158ms new 1468ms old 1462ms diff 6ms new 1423ms old 1494ms diff -71ms new 1424ms old 1493ms diff -69ms new 1332ms old 1348ms diff -16ms new 1350ms old 2065ms diff -715ms new 1399ms old 1468ms diff -69ms new 932ms old 1960ms diff -1028ms
and DOMParser
new 2304ms old 3886ms diff -1582ms new 1598ms old 4438ms diff -2840ms new 1581ms old 4470ms diff -2889ms new 1563ms old 4499ms diff -2936ms new 1595ms old 4511ms diff -2916ms new 1607ms old 4453ms diff -2846ms new 1556ms old 4412ms diff -2856ms new 1576ms old 4421ms diff -2845ms new 1552ms old 4422ms diff -2870ms new 1567ms old 4430ms diff -2863ms new 1616ms old 4437ms diff -2821ms new 1576ms old 4441ms diff -2865ms new 1569ms old 4491ms diff -2922ms new 1562ms old 4416ms diff -2854ms new 1595ms old 4497ms diff -2902ms new 1566ms old 4448ms diff -2882ms new 1561ms old 4516ms diff -2955ms new 1655ms old 4419ms diff -2764ms new 1576ms old 4515ms diff -2939ms new 1668ms old 4753ms diff -3085ms
the time measured was just for parsing the document, no processing was done.
This is in fact an interesting result. I would expect that during the parsing process, IO takes the biggest amount of time, and therefore, the time for parsing should be directly dependent on the file size. On the other hand the big variation sounds like an IO issue.
Do you use a sax or a dom parser? And what exactly did you mesure, only the parsing or including some processing?
by now I spent more time writing this post than actually testing the whole thing - I should get back to wock
From your results, I would assume that the interestion point is the usage of the caches (disk or memory cache): The first result differs in both cases significantly from all other results.
The SAX versions most likely runs completely out of the (disk-) cache. That explains why the file size does not really matter. The DOM version with the new file shows the same behaviour, which explains why there is not much difference in the execution time compared to the SAX version. The DOM version on the old file seems to break the cache usage (most likely because of the bigger memory footprint due to the bigger number of nodes in the tree).
sorry Uli, but you missed the point of my post completly. If I had used a DOMParser this results would have presented me an easy way to speedup the whole thing. But as I’m already using the SAXParser changing the file format would yield very little gain. The 10-15min spent in ‘testing’ safed me from wasting much time changing my code and Bodo (the Mathe Bodo(tm)) in changing his. As long as something is easy to test, write a simple test first before diving head-first into changing (and possibly breaking) existing code…