Skip to content

Latest commit

 

History

History
163 lines (125 loc) · 7.79 KB

File metadata and controls

163 lines (125 loc) · 7.79 KB

Step 3.4: Regular expressions and custom parsers

Estimated time: 1 day

Regular expressions

To operate with regular expressions there is the regex crate in Rust ecosystem, which is kinda a default choice to go with in most cases.

A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences. In exchange, all searches execute in linear time with respect to the size of the regular expression and search text. Much of the syntax and implementation is inspired by RE2.

If you need additional features (like look around and backreferences), consider using:

Compile only once

Important to know, that in Rust regular expression needs to be compiled before we can use it. The compilation is not cheap. So, the following code introduces a performance problem:

fn is_email(email: &str) -> bool {
    let re = Regex::new(".+@.+").unwrap();  // compiles every time the function is called
    re.is_match(email)
}

To omit unnecessary performance penalty we should compile regular expression once and reuse its compilation result. This is easily achieved by using the once_cell crate both in global and/or local scopes:

static REGEX_EMAIL: Regex = once_cell::sync::Lazy::new(|| {
    Regex::new(".+@.+").unwrap()
}); // compiles once on the first use

fn is_email(email: &str) -> bool {
    REGEX_EMAIL.is_match(email)
}

This may feel different with how regular expressions are used in other programming languages, because some of them implicitly cache compilation results and/or do not expose compilation API at all (like PHP). But if your background is a language like Go or Java, this concept should be familiar to you.

Custom parsers

If regular expressions are not powerful enough for your parsing problem, then you are ended up with writing your own parser. Rust ecosystem has numerous crates to help with that:

For better understanding parsing problem and approaches, along with some examples, read through the following articles:

Task

Given the following Rust fmt syntax grammar:

format_string := text [ maybe_format text ] *
maybe_format := '{' '{' | '}' '}' | format
format := '{' [ argument ] [ ':' format_spec ] [ ws ] * '}'
argument := integer | identifier

format_spec := [[fill]align][sign]['#']['0'][width]['.' precision]type
fill := character
align := '<' | '^' | '>'
sign := '+' | '-'
width := count
precision := count | '*'
type := '' | '?' | 'x?' | 'X?' | identifier
count := parameter | integer
parameter := argument '$'

In the above grammar,

  • text must not contain any '{' or '}' characters,
  • ws is any character for which char::is_whitespace returns true, has no semantic meaning and is completely optional,
  • integer is a decimal integer that may contain leading zeroes and must fit into an usize and
  • identifier is an IDENTIFIER_OR_KEYWORD (not an IDENTIFIER) as defined by the Rust language reference.

Implement a parser to parse sign, width and precision from a given input (assumed to be a format_spec).

Provide implementations in two flavours: regex-based and via building a custom parser.

Prove your implementation correctness with tests.

Questions

After completing everything above, you should be able to answer (and understand why) the following questions:

  • How does regex crate achieve linear time complexity? In what price?
  • How to avoid regular expression recompilation in Rust? Why is it important?
  • Which are the common kinds of libraries for writing custom parses in Rust? Which benefits does each one have?
  • What advantages does libraries give for writing a custom parser? Are they mandatory? When does it make sense to avoid using a library for implementing a parser?