How to Parse HTML in Ruby

Are you trying to parse HTML with Ruby?

This task can be a bit difficult if you don’t have the right tools.

But today you’re in luck!

Because Ruby has this wonderful library called Nokogiri, which makes HTML parsing a walk in the park.

Let’s see some examples.

First, install the nokogiri gem with:

gem install nokogiri

If you have issues installing the gem try this:

gem install nokogiri -- --use-system-libraries

How to Extract The Title

Then create the following script, which contains a basic HTML snippet that will be parsed by nokogiri.

Run this code to get the page title:

require 'nokogiri'

html        = "testactual content here..."
parsed_data = Nokogiri::HTML.parse(html)

puts parsed_data.title
=> "test"

If you want to parse data directly from a URL, instead of an HTML string…

You can do this:

require 'open-uri'

Nokogiri::HTML.parse(open('http://example.com')).title

This will download the HTML & get you the title.

Now:

Getting the title is nice, but you probably want to see more advanced examples.

Right?

Let’s take a look at how to extract links from a website.

Extracting Anchor Links

If you want all the links from a page first you’ll need the HTML.

You can use the same open-uri technique to download the HTML for any public website.

Then parse it with Nokogiri to get a document object.

Like this:

document = Nokogiri::HTML.parse(open('http://example.com'))

document.class
# Nokogiri::HTML::Document

You can query this object for information in one of two ways:

Using XPath queries
Using CSS selectors

Let’s see how to do this using XPath first.

Here’s the code:

tags = document.xpath("//a")

What does that do?

This filters through all the HTML tags in the page, and gives you the ones you are requesting.

In this case “a” tags.

Which are the tags that contain links in HTML.

Now:

What you have is an array of Nokogiri::XML::Element representing these tags.

If you want to get the link URL & text you can do this:

tags.each do |tag|
  puts "#{tag[:href]}\t#{tag.text}"
end

This will print all the links, one per line, on your screen.

If instead of links you want to scrap some other information, like a list of images available on the page, you can follow the same process.

The only thing you need to change is the type of tag you want.

For example:

tags        = document.xpath("//img")
images_urls = tags.map { |t| t[:src] }

Where img is the HTML tag for images, and src is the attribute where the image URL is stored.

To find the correct CSS selector & attributes use your browser’s developer tools.

Using CSS Selectors With Nokogiri

You can use CSS selectors by replacing the xpath method with the css method.

Here’s an example:

headers    = document.css("h1")
paragraphs = document.css("p")

Note: The difference between at_css & css is that the first one only returns the first matched element, but the latter returns ALL matched elements.

Using CSS gets you the same results, the whole point is telling Nokogiri what HTML elements you want to work with.

Most developers are more familiar with CSS than XPath, so you want to use CSS.

Summary

You can find the Nokogiri documentation here:

http://www.rubydoc.info/github/sparklemotion/nokogiri

How to Extract The Title

Extracting Anchor Links

Using CSS Selectors With Nokogiri

Summary

Related