XPath Performance Comparison

Martin Klang

A very simple study comparing the performance of three open-source XPath engines.

Based on a performance study by Martin Böhm and Jean-Jacques Dubray, click here for more details.

Purpose

This study compares the time and memory complexity of four different XPath configurations, and how performance varies with an increasing XML document size.

Participating XPath Engines

Name Configuration XML Parser
TagBox TagBox XPath Engine v0.9.1Xerces 2.0.0
Jaxen Jaxen dom4j XPath Enginedom4j
Xalan 2.0.1 Xalan-Java XSLT ProcessorXerces 1.2.3
Xalan 2.3.1 Xalan-Java XSLT ProcessorXerces 2.0.0

The Test

Two XPath expressions were evaluated by the four different configurations and the first node of the result set from a varying sized document was retrieved.

The expressions ran against a simple three-level document that was created in memory. The document had 'n' level one elements (<EDB-ART>), where 'n' varied from 1 to 10000 in 1000 increments. The level one elements in turn had 20 child elements, each with a name unique in the document.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<plm_result>
  <EDB-ART>
    <T_MASTER_DAT.PART_ID0x0>123456789</T_MASTER_DAT.PART_ID0x0>
    <T_MASTER_DAT.PART_ID1x0>123456789</T_MASTER_DAT.PART_ID1x0>
    <T_MASTER_DAT.PART_ID2x0>123456789</T_MASTER_DAT.PART_ID2x0>
    <T_MASTER_DAT.PART_ID3x0>123456789</T_MASTER_DAT.PART_ID3x0>
    <T_MASTER_DAT.PART_ID4x0>123456789</T_MASTER_DAT.PART_ID4x0>
    <T_MASTER_DAT.PART_ID5x0>123456789</T_MASTER_DAT.PART_ID5x0>
    <T_MASTER_DAT.PART_ID6x0>123456789</T_MASTER_DAT.PART_ID6x0>
    <T_MASTER_DAT.PART_ID7x0>123456789</T_MASTER_DAT.PART_ID7x0>
    <T_MASTER_DAT.PART_ID8x0>123456789</T_MASTER_DAT.PART_ID8x0>
    <T_MASTER_DAT.PART_ID9x0>123456789</T_MASTER_DAT.PART_ID9x0>
    <T_MASTER_DAT.PART_ID10x0>123456789</T_MASTER_DAT.PART_ID10x0>
    <T_MASTER_DAT.PART_ID11x0>123456789</T_MASTER_DAT.PART_ID11x0>
    <T_MASTER_DAT.PART_ID12x0>123456789</T_MASTER_DAT.PART_ID12x0>
    <T_MASTER_DAT.PART_ID13x0>123456789</T_MASTER_DAT.PART_ID13x0>
    <T_MASTER_DAT.PART_ID14x0>123456789</T_MASTER_DAT.PART_ID14x0>
    <T_MASTER_DAT.PART_ID15x0>123456789</T_MASTER_DAT.PART_ID15x0>
    <T_MASTER_DAT.PART_ID16x0>123456789</T_MASTER_DAT.PART_ID16x0>
    <T_MASTER_DAT.PART_ID17x0>123456789</T_MASTER_DAT.PART_ID17x0>
    <T_MASTER_DAT.PART_ID18x0>123456789</T_MASTER_DAT.PART_ID18x0>
    <T_MASTER_DAT.PART_ID19x0>123456789</T_MASTER_DAT.PART_ID19x0>
  </EDB-ART>
</plm_result>

The first expression indexed the first level two element on the descendant-or-self axis:

//T_MASTER_DAT.PART_ID0x0

The second expression indexed the last of all of the level two elements, eg for a document with 1000 level one elements:

//T_MASTER_DAT.PART_ID19x999

Time was measured as the difference just before the XPath expression was evaluated and just after a single element had been retrieved from the result set.

Memory was measured as the difference in total memory used by the JVM before the document was created and after the first element had been retrieved from the XPath result set.

The tests were executed on a single occasion so there has been no averaging of results.

All four configurations were tested using the simplest API provided, with no special caching of parsed expressions or results.

Test units consisted of evaluating a single XPath expression 100 times with one XPath engine.

All tests ran 'cold', ie for each test unit a separate Java process was started with no pre-runs to condition the JVM.

The Platform

The test machine was a Linux 2.4.18 PC with a single AMD Athlon 1700+ processor and 512Mb RAM, running jdk1.3.1:

mars@spock:~> java -version
java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1-b24)
Java HotSpot(TM) Client VM (build 1.3.1-b24, mixed mode)

mars@spock:~> uname -a
Linux spock 2.4.18 #2 Sun Mar 17 17:25:28 GMT 2002 i686 unknown

mars@spock:~> cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 6
model name	: AMD Athlon(tm) XP 1700+
stepping	: 2
cpu MHz		: 1477.531
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips	: 2949.12

The Results

All data and graphs can be viewed here.

The csv file resulting from running the test suite was imported into MS Excel<tm> to produce graphs that were then exported to html format.

For the memory comparison, a graph was produced for the each of the two XPath expressions.

The graphs for the time comparison were done in the same way, but there's also a second set of graphs that show a close-up of the lower range of values.

The original spreadsheet is available here.

Analysis

When retrieving the last element, every element in the document has to be compared against the expression. In doing this, two of the four tested engines have linear time complexity in relation to the size of the XML document: Xalan 2.0.1 and TagBox. The other two, Xalan 2.3.1 and Jaxen, seem to behave exponentially, and also start out slower.

All engines bar Jaxen have constant or near constant ordo for retrieving the first element - this would be indicative of an implementation that turns the XPath expression into a filtering iterator that indexes the XML directly. Jaxen in contrast take as long time to retrieve the first element as it does the last, which would suggest that the result set is created in full before any of the nodes are actually retrieved.

Comments

Fluctuations in the measurements can be put down to garbage collection occuring after the document has been created. This is especially noticeable in the second graph comparing Xalan 2.0.1, TagBox and Xalan 2.3.1 performance where the total time spent is less than 1000ms.

While it may seem that Xalan has taken a nose dive performance wise with the most recent of the two versions tested, it is important to remember that this projects focus is not on XPath evaluations, but XSL transforms. In this context performance is achieved by pre-processing stylesheets into templates that will transform an XML document, which is not in the scope of this study.

It is also worth noting that Jaxen was only tested with dom4j, not any of the other data models it supports - DOM, eXML or JDOM.

Resources

test results graphs and data
original spreadsheet
test suite java sauce files, scripts and libs
test web application simple tagbox application that lets you run your own tests