Nokogiri Performance: xpath vs. tree walking/iterating
At Gnip we're doing some heavy XML parsing in Ruby and obviously chose Nokogiri as the underlying engine. I started the week doing xpath searches to tease out the elements/attributes from the documents we were parsing, and ended the week iterating over the root's children in order to achieve significant (~3x) performance improvements. While xpath is convenient, as you don't have to pay attention to document structure (assuming you have your head around xpath syntax to begin with), it's horribly expensive in terms of processing time. It's truly searching the document for what you want; expensive!
There's a big difference between using the "search" (e.g. xpath) interface on top of a parsed DOM and running over the entire tree, testing each node for what it is you're looking for. Code gets a little uglier when doing the latter, it's not as elegant/clean, but performance starts kicking in when you do it. Moving from xpath search-style parsing, to tree walking yielded ~3x performance improvement in parsing for me. I suspect that going all the way to either Nokogiri's Reader or SAX interface would yield an additional 10% improvement over that. However, I'm stoping here for now as the readability/complexity detriment in doing a full Reader/SAX stack-maintenance model doesn't feel worth it at the moment. I would like to try pauldix's SAX machine declarative parser (built on Nokogiri) and run some benchmarks, but... another day.
Old:
doc = Nokogiri::XML.parse(some_xml_string) { |cfg| cfg.noblanks }
doc.xpath('/xmlns:feed/xmlns:entry').each do |entry|
at = entry.xpath('xmlns:published').text
id = entry.xpath('xmlns:id').text
...
end
New:
doc = Nokogiri::XML.parse(some_xml_string) { |cfg| cfg.noblanks }
doc.root.children.each do |node|
if node.name == 'entry'
node.children.each do |sub_e|
at = sub_e.inner_text if sub_e.name == 'published'
id = sub_e.inner_text if sub_e.name == 'id'
end
end
...
end