A very simple study comparing the performance of three open-source XPath engines.
Based on a performance study by Martin Böhm and Jean-Jacques Dubray, click here for more details.
Purpose
This study compares the time and memory complexity of four different XPath configurations, and how performance varies with an increasing XML document size.
Participating XPath Engines
Name | Configuration | XML Parser |
TagBox | TagBox XPath Engine v0.9.1 | Xerces 2.0.0 |
Jaxen | Jaxen dom4j XPath Engine | dom4j |
Xalan 2.0.1 | Xalan-Java XSLT Processor | Xerces 1.2.3 |
Xalan 2.3.1 | Xalan-Java XSLT Processor | Xerces 2.0.0 |
The Test
Two XPath expressions were evaluated by the four different configurations and the first node of the result set from a varying sized document was retrieved.
The expressions ran against a simple three-level document that was created in memory. The document had 'n' level one elements (<EDB-ART>), where 'n' varied from 1 to 10000 in 1000 increments. The level one elements in turn had 20 child elements, each with a name unique in the document.
Example:
<?xml version="1.0" encoding="UTF-8"?> <plm_result> <EDB-ART> <T_MASTER_DAT.PART_ID0x0>123456789</T_MASTER_DAT.PART_ID0x0> <T_MASTER_DAT.PART_ID1x0>123456789</T_MASTER_DAT.PART_ID1x0> <T_MASTER_DAT.PART_ID2x0>123456789</T_MASTER_DAT.PART_ID2x0> <T_MASTER_DAT.PART_ID3x0>123456789</T_MASTER_DAT.PART_ID3x0> <T_MASTER_DAT.PART_ID4x0>123456789</T_MASTER_DAT.PART_ID4x0> <T_MASTER_DAT.PART_ID5x0>123456789</T_MASTER_DAT.PART_ID5x0> <T_MASTER_DAT.PART_ID6x0>123456789</T_MASTER_DAT.PART_ID6x0> <T_MASTER_DAT.PART_ID7x0>123456789</T_MASTER_DAT.PART_ID7x0> <T_MASTER_DAT.PART_ID8x0>123456789</T_MASTER_DAT.PART_ID8x0> <T_MASTER_DAT.PART_ID9x0>123456789</T_MASTER_DAT.PART_ID9x0> <T_MASTER_DAT.PART_ID10x0>123456789</T_MASTER_DAT.PART_ID10x0> <T_MASTER_DAT.PART_ID11x0>123456789</T_MASTER_DAT.PART_ID11x0> <T_MASTER_DAT.PART_ID12x0>123456789</T_MASTER_DAT.PART_ID12x0> <T_MASTER_DAT.PART_ID13x0>123456789</T_MASTER_DAT.PART_ID13x0> <T_MASTER_DAT.PART_ID14x0>123456789</T_MASTER_DAT.PART_ID14x0> <T_MASTER_DAT.PART_ID15x0>123456789</T_MASTER_DAT.PART_ID15x0> <T_MASTER_DAT.PART_ID16x0>123456789</T_MASTER_DAT.PART_ID16x0> <T_MASTER_DAT.PART_ID17x0>123456789</T_MASTER_DAT.PART_ID17x0> <T_MASTER_DAT.PART_ID18x0>123456789</T_MASTER_DAT.PART_ID18x0> <T_MASTER_DAT.PART_ID19x0>123456789</T_MASTER_DAT.PART_ID19x0> </EDB-ART> </plm_result>
The first expression indexed the first level two element on the descendant-or-self axis:
//T_MASTER_DAT.PART_ID0x0
The second expression indexed the last of all of the level two elements, eg for a document with 1000 level one elements:
//T_MASTER_DAT.PART_ID19x999
Time was measured as the difference just before the XPath expression was evaluated and just after a single element had been retrieved from the result set.
Memory was measured as the difference in total memory used by the JVM before the document was created and after the first element had been retrieved from the XPath result set.
The tests were executed on a single occasion so there has been no averaging of results.
All four configurations were tested using the simplest API provided, with no special caching of parsed expressions or results.
Test units consisted of evaluating a single XPath expression 100 times with one XPath engine.
All tests ran 'cold', ie for each test unit a separate Java process was started with no pre-runs to condition the JVM.
The Platform
The test machine was a Linux 2.4.18 PC with a single AMD Athlon 1700+ processor and 512Mb RAM, running jdk1.3.1:
mars@spock:~> java -version java version "1.3.1" Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1-b24) Java HotSpot(TM) Client VM (build 1.3.1-b24, mixed mode) mars@spock:~> uname -a Linux spock 2.4.18 #2 Sun Mar 17 17:25:28 GMT 2002 i686 unknown mars@spock:~> cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) XP 1700+ stepping : 2 cpu MHz : 1477.531 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 2949.12
The Results
All data and graphs can be viewed here.
The csv file resulting from running the test suite was imported into MS Excel<tm> to produce graphs that were then exported to html format.
For the memory comparison, a graph was produced for the each of the two XPath expressions.
The graphs for the time comparison were done in the same way, but there's also a second set of graphs that show a close-up of the lower range of values.
The original spreadsheet is available here.
Analysis
When retrieving the last element, every element in the document has to be compared against the expression. In doing this, two of the four tested engines have linear time complexity in relation to the size of the XML document: Xalan 2.0.1 and TagBox. The other two, Xalan 2.3.1 and Jaxen, seem to behave exponentially, and also start out slower.
All engines bar Jaxen have constant or near constant ordo for retrieving the first element - this would be indicative of an implementation that turns the XPath expression into a filtering iterator that indexes the XML directly. Jaxen in contrast take as long time to retrieve the first element as it does the last, which would suggest that the result set is created in full before any of the nodes are actually retrieved.
Comments
Fluctuations in the measurements can be put down to garbage collection occuring after the document has been created. This is especially noticeable in the second graph comparing Xalan 2.0.1, TagBox and Xalan 2.3.1 performance where the total time spent is less than 1000ms.
While it may seem that Xalan has taken a nose dive performance wise with the most recent of the two versions tested, it is important to remember that this projects focus is not on XPath evaluations, but XSL transforms. In this context performance is achieved by pre-processing stylesheets into templates that will transform an XML document, which is not in the scope of this study.
It is also worth noting that Jaxen was only tested with dom4j, not any of the other data models it supports - DOM, eXML or JDOM.
Resources
test results graphs and data |
original spreadsheet |
test suite java sauce files, scripts and libs |
test web application simple tagbox application that lets you run your own tests |