Skip to content
coonsta edited this page Sep 13, 2010 · 3 revisions

gg-grammar defines a grammar from a list of rules. Each rule is list surrounded by parentheses. The first symbol of each rule is its name. (There are four rules in the above example, start, term, fact, and num written (gg-grammar (start ...) (term ...) (fact ...) (num ...)).) The first rule in the grammar is the start rule. When you call gg-parse the first rule is interpreted against the contents of the current buffer at the current point. In the above example the rule is called start, but you can call it anything you like.

The rest of the symbols in a rule define one or more alternatives. If there is more than one alternative, the alternatives are separated by |. In the above example start has one alternative, term and fact have three, and num has ten—one for each digit. A PEG parser, unlike typical BNF, tries the alternatives in order until one of them succeeds. This means the order of alternatives is significant and you must pay close attention that an earlier alternative is not a prefix of a later alternative, or the later alternative will never be activated. An alternative has two parts, a parsing expression and an action, separated by an arrow.

The parsing expression is a sequence of things that must be matched for the alternative to succeed. These can be:

  • A character literal. ?a matches the letter a. The result of parsing a character literal is the character itself.
  • A regular expression. "[aeiou]+" matches one or more vowels. The result of parsing a regular expression is the string that the expression matched, but if a regular expression is the last expression evaluated, the usual functions like match-beginning, match-end, and match-string let you extract groups. See also Syntax of Regular Expressions in the Emacs manual.
  • The end of the buffer, eof. Use this at the end of your start rule to ensure your parser consumed all of the input in the buffer. The result is t.
  • The name of a rule. The result of a rule is whatever its actions return.
  • A filter, ... if s-exp. s-exp should be a function that takes one argument. The function is passed the result of the expression to the left of the if; if the function returns true, then the parse succeeds and the result is whatever the result of the expression on the left is.
  • A binding, x := .... This remembers the result of the expression on the right. Filters and actions can refer to the variable.
  • Lookahead & .... This runs ahead in the input. If the expression on the right matches, then the parse succeeds. No input is consumed past the &.
  • Negative lookahead ~ .... Like lookahead this runs ahead in the input, but the parse succeeds only if the expression on the right isn’t matched. Negative lookahead is typically used in PEG parses for implementing “longest match” rules by checking that whatever comes next couldn’t be parsed too.

The action is an s-expression written after ->. Actions compute the result of parsing the expression. In the example above the actions do arithmetic operations; typically actions build an abstract-syntax tree. When your action runs two variables, gg-start and gg-end, are set to the positions in the buffer at the start and end of the expression being parsed. You can use these annotate your abstract syntax tree with position information. Actions are optional. If you don’t write an action, the result of the alternative is the result of the rightmost expression.

Because a PEG parser may memoize results and reuse them after back-tracking, actions should be pure functions. For example, if you want to use Guruguru to colorize a buffer, you should build a list of attribute runs in your actions and then walk over them when the parser completes to colorize the buffer, instead of colorizing the buffer directly in the actions.

See also Debugging Grammars.

Clone this wiki locally