Are you trying to parse HTML with Ruby?
This task can be a bit difficult if you don’t have the right tools.
But today you’re in luck!
Because Ruby has this wonderful library called Nokogiri, which makes HTML parsing a walk in the park.
Let’s see some examples.
First, install the nokogiri gem with:
gem install nokogiri
If you have issues installing the gem try this:
gem install nokogiri -- --use-system-libraries
How to Extract The Title
Then create the following script, which contains a basic HTML snippet that will be parsed by nokogiri.
Run this code to get the page title:
require 'nokogiri' html = "test actual content here..." parsed_data = Nokogiri::HTML.parse(html) puts parsed_data.title => "test"
If you want to parse data directly from a URL, instead of an HTML string…
You can do this:
require 'open-uri' Nokogiri::HTML.parse(open('http://example.com')).title
This will download the HTML & get you the title.
Now:
Getting the title is nice, but you probably want to see more advanced examples.
Right?
Let’s take a look at how to extract links from a website.
Extracting Anchor Links
If you want all the links from a page first you’ll need the HTML.
You can use the same open-uri
technique to download the HTML for any public website.
Then parse it with Nokogiri to get a document object.
Like this:
document = Nokogiri::HTML.parse(open('http://example.com')) document.class # Nokogiri::HTML::Document
You can query this object for information in one of two ways:
- Using XPath queries
- Using CSS selectors
Let’s see how to do this using XPath first.
Here’s the code:
tags = document.xpath("//a")
What does that do?
This filters through all the HTML tags in the page, and gives you the ones you are requesting.
In this case “a” tags.
Which are the tags that contain links in HTML.
Now:
What you have is an array of Nokogiri::XML::Element
representing these tags.
If you want to get the link URL & text you can do this:
tags.each do |tag| puts "#{tag[:href]}\t#{tag.text}" end
This will print all the links, one per line, on your screen.
If instead of links you want to scrap some other information, like a list of images available on the page, you can follow the same process.
The only thing you need to change is the type of tag you want.
For example:
tags = document.xpath("//img") images_urls = tags.map { |t| t[:src] }
Where img
is the HTML tag for images, and src
is the attribute where the image URL is stored.
To find the correct CSS selector & attributes use your browser’s developer tools.
Using CSS Selectors With Nokogiri
You can use CSS selectors by replacing the xpath
method with the css
method.
Here’s an example:
headers = document.css("h1") paragraphs = document.css("p")
Note: The difference between
at_css
&css
is that the first one only returns the first matched element, but the latter returns ALL matched elements.
Using CSS gets you the same results, the whole point is telling Nokogiri what HTML elements you want to work with.
Most developers are more familiar with CSS than XPath, so you want to use CSS.
Summary
You can find the Nokogiri documentation here:
http://www.rubydoc.info/github/sparklemotion/nokogiri
You might also like: