Tue
15
Mar '05
test first again
by Frank Spychalski filed under Work

This weeks topic is “test before you start working” :-)

One of the projects I’m working on is a tool which reads a pretty big XML input file (3.2mb). I was looking for ways to speed up the whole thing and noticed that parts of the file look like this:

<host ... >
  <ipv4>ip</ipv4>
  <ipv4>ip</ipv4>
  <ipv4>ip</ipv4>
        …
</host>

My idea was to move the list of ips into a comma seperated list in an attribute. 5 minutes and an XSLT later I had the configfile in this new format. Shrunk from 3.2mb to 2.2mb. Another 5 minutes later I had a short java programm which parses both files and measures the time in both cases. Surprising result: very litte difference, the new file parses on average less then 10% faster than the old format and great variation (as much as 100%) in the time needed. So I guess I have to search somewhere else for ways to speed the whole thing up but at least I didn’t waste much time.

edit for more info

the results using a SAXParser (the tool is using a SAXParser now):

new 2771ms old 2546ms diff 225ms
new 999ms old 2887ms diff -1888ms
new 1381ms old 1433ms diff -52ms
new 1333ms old 1357ms diff -24ms
new 1381ms old 1426ms diff -45ms
new 2578ms old 2660ms diff -82ms
new 1387ms old 1554ms diff -167ms
new 2334ms old 2905ms diff -571ms
new 2105ms old 5054ms diff -2949ms
new 1308ms old 4739ms diff -3431ms
new 1999ms old 1440ms diff 559ms
new 1389ms old 1513ms diff -124ms
new 2974ms old 1816ms diff 1158ms
new 1468ms old 1462ms diff 6ms
new 1423ms old 1494ms diff -71ms
new 1424ms old 1493ms diff -69ms
new 1332ms old 1348ms diff -16ms
new 1350ms old 2065ms diff -715ms
new 1399ms old 1468ms diff -69ms
new 932ms old 1960ms diff -1028ms

and DOMParser

new 2304ms old 3886ms diff -1582ms
new 1598ms old 4438ms diff -2840ms
new 1581ms old 4470ms diff -2889ms
new 1563ms old 4499ms diff -2936ms
new 1595ms old 4511ms diff -2916ms
new 1607ms old 4453ms diff -2846ms
new 1556ms old 4412ms diff -2856ms
new 1576ms old 4421ms diff -2845ms
new 1552ms old 4422ms diff -2870ms
new 1567ms old 4430ms diff -2863ms
new 1616ms old 4437ms diff -2821ms
new 1576ms old 4441ms diff -2865ms
new 1569ms old 4491ms diff -2922ms
new 1562ms old 4416ms diff -2854ms
new 1595ms old 4497ms diff -2902ms
new 1566ms old 4448ms diff -2882ms
new 1561ms old 4516ms diff -2955ms
new 1655ms old 4419ms diff -2764ms
new 1576ms old 4515ms diff -2939ms
new 1668ms old 4753ms diff -3085ms

the time measured was just for parsing the document, no processing was done.


4 Responses to “test first again”

  1. 1

    This is in fact an interesting result. I would expect that during the parsing process, IO takes the biggest amount of time, and therefore, the time for parsing should be directly dependent on the file size. On the other hand the big variation sounds like an IO issue.

    Do you use a sax or a dom parser? And what exactly did you mesure, only the parsing or including some processing?

    Uli (March 15th, 2005 at 15:46)
  2. 2

    by now I spent more time writing this post than actually testing the whole thing - I should get back to wock :-)

    Frank Spychalski (March 15th, 2005 at 16:11)
  3. 3

    From your results, I would assume that the interestion point is the usage of the caches (disk or memory cache): The first result differs in both cases significantly from all other results.
    The SAX versions most likely runs completely out of the (disk-) cache. That explains why the file size does not really matter. The DOM version with the new file shows the same behaviour, which explains why there is not much difference in the execution time compared to the SAX version. The DOM version on the old file seems to break the cache usage (most likely because of the bigger memory footprint due to the bigger number of nodes in the tree).

    Uli (March 15th, 2005 at 17:40)
  4. 4

    sorry Uli, but you missed the point of my post completly. If I had used a DOMParser this results would have presented me an easy way to speedup the whole thing. But as I’m already using the SAXParser changing the file format would yield very little gain. The 10-15min spent in ‘testing’ safed me from wasting much time changing my code and Bodo (the Mathe Bodo(tm)) in changing his. As long as something is easy to test, write a simple test first before diving head-first into changing (and possibly breaking) existing code…

    Frank Spychalski (March 15th, 2005 at 17:54)

Sorry, due to SPAM comments are closed for older posts. If you want to leave a comment, you can sent it via mail and I will put it online. That's less enoying than the constant removal of SPAM. Do not edit the beginning of the subject or my filter will delete your mail.