Parsing is the art of making sense of a bunch of strings and converting them into something we can understand. You could use regular expressions, but they are not always suitable for the job.
For example, it is common knowledge that parsing HTML with regular expressions is probably not a good idea.
In Ruby we have nokogiri that can do this work for us, but you can learn a lot by building your own parser. Let’s get started!
Parsing with Ruby
The core of our parser is the StringScanner class.
This class holds a copy of a string and a position pointer. The pointer will allow us to traverse the string in search of certain tokens.
The methods we will be using are:
- .peek
- .scan_until
- .getch
Another useful method is .scan (without the until).
Note:
If StringScanner is no available to you try adding require 'strscan'
I wrote two tests as documentation so we can understand how this class is supposed to work:
describe StringScanner do let (:buff) { StringScanner.new "testing" } it "can peek one step ahead" do expect(buff.peek 1).to eq "t" end it "can read one char and return it" do expect(buff.getch).to eq "t" expect(buff.getch).to eq "e" end end
One important thing to notice about this class is that some methods advance the position pointer (getch, scan), while others don’t (peek). At any point your can inspect your scanner (using .inspect or p) to see where it’s at.
The parser class
The parser class is where most of the work happens, we will initialize it with the snippet of text we want to parse and it will create a StringScanner for that and call the parse method:
def initialize(str) @buffer = StringScanner.new(str) @tags = [] parse end
In the test we define it like this:
let(:parser) { Parser.new "<body>testing</body> <title>parsing with ruby</title>" }
We will dive in on how this class does it job in a bit, but first let’s take a look at the last piece of our program.
The Tag Class
This class is very simple, it mainly serves as a container & data class for the parsing results.
class Tag attr_reader :name attr_accessor :content def initialize(name) @name = name end end
Let’s Parse!
To parse something we will need to look at our input text to find patterns. For example, we know HTML code has the following form:
<tag>contents</tag>
There’s clearly two different components we can identify here, the tag names and the text inside the tags. If we were to define a formal grammar using the BNF notation it would look something like this:
tag = <opening_tag> <contents> <closing_tag> opening_tag = "<" <tag_name> ">" closing_tag = "</" <tag_name> ">"
We are going to use StringScanners’s peek to see if the next symbol on our input buffer is an opening tag. If that’s the case then we will call the find_tag and find_content methods on our Parser class:
def parse_element if @buffer.peek(1) == '<' @tags << find_tag last_tag.content = find_content end end
The find_tag method will:
- ‘Consume’ the opening tag character
- Scan until the closing symbol (“>”) is found
- Create and return a new Tag object with the tag name
Here is the code, notice how we have to chop the last character. This is because scan_until includes the ‘>’ in the results, and we don’t want that.
def find_tag @buffer.getch tag = @buffer.scan_until />/ Tag.new(tag.chop) end
The next step is finding the content inside the tag, this shouldn’t be too hard since the scan_until method advances the position pointer to the right spot. We are going to use scan_until again to find the closing tag and return the tag contents.
def find_content tag = last_tag.name content = @buffer.scan_until /<\/#{tag}>/ content.sub("</#{tag}>", "") end
Now:
All we need to do is call parse_element
on a loop until we can’t find more tags on our input buffer.
def parse until @buffer.eos? skip_spaces parse_element end end
You can find the complete code here: https://github.com/matugm/simple-parser. You can also look at the ‘nested_tags’ branch for the extended version that can deal with tags inside another tag.
Conclusion
Writing a parser is an interesting topic and it can also get pretty complicated at times.
If you don’t want to make your own parser from scratch you can use one of the so-called ‘parser generators’. In Ruby we have treetop and parslet.
nice overview!
This article was very helpful for me.
Thank you.