Parsing XML with Ruby

Just for kicks and giggles, I decided to parse xml with each of the main libraries in Ruby (REXML, Hpricot, libxml-ruby), so I could see the differences between them in both API (getting at elements and attributes) and speed. I did two different xml formats. The first, Delicious, uses an attribute based approach, and the second, Twitter, uses a more elemental one. If you look at the xml files linked below, the previous sentence might make more sense.

Note: This is not for scientific and speed purposes but rather to get a feel for each of the libraries and how you traverse xml nodes and such with them.

The XML

Here are the files I used for reference. You’ll have to view source once you click on one of these links to actually see the xml.

posts.xml – Uses xml element for object (post) and xml attributes for object attributes
timeline.xml – Uses xml element for object (status) and child xml elements for attributes

REXML

Pros: In the standard library
Cons: Slow, I don’t like the name

%w[benchmark pp rexml/document].each { |x| require x }

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = REXML::Document.new(xml), []
  doc.elements.each('posts/post') do |p|
    posts << p.attributes
  end
  # pp posts
}


################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = REXML::Document.new(xml), []
  doc.elements.each('statuses/status') do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.elements[a].text
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.elements['user'].elements[a].text
    end
    statuses << h
  end
  # pp statuses
}

Hpricot

Pros: Cool name, created by _why, faster than REXML, also does HTML, creative API
Cons: Not as fast as libxml-ruby, more of an HTML parser linguistically (ie: uses innerHTML instead of text or content, etc.)

%w[benchmark pp rubygems].each { |x| require x }
gem 'hpricot', '>= 0.6'
require 'hpricot'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = Hpricot::XML(xml), []
  (doc/:post).each do |p|
    posts << p.attributes
  end
  # pp posts
}


################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = Hpricot::XML(xml), []
  (doc/:status).each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.at(a).innerHTML
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.at('user').at(a).innerHTML
    end
    statuses << h
  end
  # pp statuses
}

libxml-ruby

Pros: Blistering fast
Cons: Hpricot has cooler name, REXML and Hpricot both feel easier to use out of the box

%w[benchmark pp rubygems].each { |x| require x }
gem 'libxml-ruby', '>= 0.8.3'
require 'xml'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, posts = parser.parse, []
  doc.find('//posts/post').each do |p|
    posts << p.attributes.inject({}) { |h, a| h[a.name] = a.value; h }
  end
  # pp posts
}


################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, statuses = parser.parse, []
  doc.find('//statuses/status').each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.find(a).first.content
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.find('user').first.find(a).first.content
    end
    statuses << h
  end
  # pp statuses
}

Conclusion

I’ll probably start using libxml-ruby but Hpricot is more fun (and I’ve used it a ton). Oh, if you are curious, this was the output from the scripts above on my machine.

=rexml
delicious     0.020000   0.000000   0.020000 (  0.021139)
twitter       0.940000   0.020000   0.960000 (  0.988666)

=hpricot
delicious     0.010000   0.000000   0.010000 (  0.005548)
twitter       0.250000   0.010000   0.260000 (  0.258320)

=libxml-ruby
delicious     0.000000   0.000000   0.000000 (  0.007829)
twitter       0.030000   0.010000   0.040000 (  0.034040)

The twitter one is slower because of the loops and hashes most likely. I doubt it has much to do with the actual parsing, though it is a larger file and would be a bit slower.

9 Comments

jney
Aug 12, 2008

hi, did you already check that post: http://thebogles.com/blog/an-hpricot-style-interface-to-libxml/, that is using libxml in hpricot way. it looks nice.
John Nunemaker
Aug 12, 2008

@jney – Sweet! No I hadn’t viewed that. Thanks for the link.
Kunal Parikh
Aug 12, 2008

Have you looked @ SimpleXML?
Seth
Aug 12, 2008

Since many web services also provide JSON feeds, have you done any benchmarking of libxml vs. json (and json-pure)?
John Nunemaker
Aug 13, 2008

@Kunal – xml-simple uses rexml under the hood and I’m technically using it with HTTParty as I’m using Active Support which uses xml-simple. So yep, I’ve looked at it but it’s going to have the same speed issues as REXML.
Rajmohan
Aug 14, 2008

Since you work so much with XML in ruby, was wondering if you have come across any ruby library that does SAX with Pull Parsing? Just like StAX in Java?
Brandon Mitchell
Aug 14, 2008

HTTParty might benefit from the work I did replacing xml-simple in ActiveSupport in favor of libxml-ruby here.

I found significant performance improvements for relatively little work, with these modifications.
Soleone
Aug 15, 2008

I definitely think libxml-ruby with a nicer API (kinda like hpricot, but more xml oriented) is the way to go! Would be cool if we could standardize something like this.

StAX would also be cool I guess, at least to have something to show the suits-people :)
jan
Aug 22, 2008

There is innerText method in Hpricot you can use instead of innerHTML. Recently I even have found out that innnerText converts entities (e.g. & to &) whereas innerHTML does not.