-
Notifications
You must be signed in to change notification settings - Fork 240
More about parsing
In previous chapter we only took a glance at the
parse_doc
method, here we take a deeper look at the Scanner object
which is passed to the parse_doc
method.
The Scanner is similar to Ruby builtin StringScanner, it remembers the position of a scan pointer (a position inside the string we're parsing). The scanning itself is a process of advancing the scan pointer through the string a small step at a time. For this there are two core methods:
-
match(regex)
matches a regex starting at current position of scan pointer, advances the scan pointer to the end of the match and returns the matching string. When regex doesn't match, returnsnil
. -
look(regex)
does all the same, except it doesn't advance the scan pointer, so it's use is to look ahead.
Let's visualize how scanning works. Here's the state of the Scanner
at the time parse_doc
gets called.
* @author <[email protected]> John Doe
^
The scan pointer (denoted as ^
) has stopped at the first
non-whitespace character after the name of the tag. At that point we
could look ahead to see what's coming. Say, we could check if we're
at the beginning of an e-mail address block:
if scanner.look(/</)
# parse-email
If so, we want to extract the e-mail address. But first lets match the
<
char which we want to exclude from our e-mail address:
scanner.match(/</)
The scan pointer has now moved forward a step:
* @author <[email protected]> John Doe
^
Now we can match the e-mail address itself and store it to a variable:
email = scanner.match(/\w+@\w+(\.\w+)+/)
Advancing the scan-pointer even further along:
* @author <[email protected]> John Doe
^
Finally we skip the closing >
:
scanner.match(/>/)
Moving our scan pointer again:
* @author <[email protected]> John Doe
^
And finally let's skip the whitespace after it using hw
method of
Scanner to skip just the horizontal whitespace:
scanner.hw
Our scan pointer ending at:
* @author <[email protected]> John Doe
^
From here on we just want to match the name of the author, which could be anything, so we just match up to the end of a line:
name = scanner.match(/.*$/)
Moving our scan pointer to the finish:
* @author <[email protected]> John Doe
^
Putting it all together we get a parse_doc
method for an @author
tag which is followed by an optional e-mail address and name of the
author:
def parse_doc(scanner)
if scanner.look(/</)
scanner.match(/</)
email = scanner.match(/\w+@\w+(\.\w+)+/)
scanner.match(/>/)
scanner.hw
end
name = scanner.match(/.*$/)
return { :tagname => :author, :name => name, :email => email }
end