Skip to content
/ SPaG Public

A compiler to translate regular expressions (regular grammars) and LL1 BNF languages (subset of context free grammars) to generated scanners and/or parsers.

License

Notifications You must be signed in to change notification settings

rrozansk/SPaG

Repository files navigation

SPaG (Scanner, Parser, and Generator)

MIT COV VER

Introduction

SPaG is an extensible purely Python framework, with no external dependencies, capable of compiling input specifications of scanners (regular grammars) and/or parsers (LL(1) BNF context free grammars) into DFA's and LL(1) parse table's. These results can then be converted into valid scanner/parser programs following the options configured within the framework during generation. The processes used to transform the input specifications into DFA's and LL(1) parse tables are very formal and well defined. Thorough table driven testing is also utilized to ensure correctness of implementation. The program generator(s) comprise the extensible portion of the framework allowing new targeted languages to be easily pluged-in. Included in the framework is a script which lightly wraps the core library functionality to accept user input file(s) which help drive the compilation/generation process from the command line. The overview below describes in detail what each component sets out to do, how it accomplished those intended goals, and the accepted input.

Installation

Installation is supported through multiple methods, all of which are listed below. Any required dependencies along with the steps needed for a proper install are listed. Installation installs the module containing scanner (RegularGrammar), parser (ContextFreeGrammar), and generator (Generator) objects as well as a command line script (spag_cli) for easily generating scanners and/or parsers from input specification(s) and optional configuration file. It also installs any supported core language generators which may grow over time.

Source

Install directly from the source code. This may include unreleased bug fixes, feature extensions, or newly supported language child generators depending on which git commit is to be built. For this reason it is a good idea to test the commit you wish to build from, ensuring it is a proper build candidate. To help with this process Makefile recipes are included to automate testing, linting, virtual environment installation/cleanup, etc. However, note that while the framework is pure, dependent free, Python the testing is not and requires prerequisites. All required python packages are listed in the requirments. Since these packages are installed in a virtual environment with the help of Make the only requirements the user needs to worry about are those which must be installed on the machine itself and include:

  • git (>= v2.1.3)
  • make (>= v4.1)
  • Python (>= v3.5)
  • Pip (>= v9.0.3)
  • setuptools (>= v40.4.3)
  • virtualenv (>= v16.0.0)
# Obtain the source code.
$ git clone https://github.com/rrozansk/SPaG.git

# Run sanity.
# 1. Constructs a virtual environment with SPaG installed for testing.
# 2. Enter the newly built virtual environment.
# 3. Lint the code to ensure standards are followed.
# 4. Test the code and generate a report to ensure a working package.
# 5. Leave the virtual environment.
# 6. Clean up.
$ make sanity

# Open the generated report in a web-browser and inspect the results.
$ chrome test_report*

# If the results look good then install SPaG from the source.
$ make install

Pip

Install a specific prebuilt pacakge distribution from the PyPI repository. Since the distributions are thoroughly tested and linted before publishing there is no need for that step in this method of installation, unlike the source method. The only requirements needed on the host machine for this method to work include:

  • Python (>= v3.5)
  • Pip (>= v9.0.3)
# Only step required!
$ pip install SPaG

Overview

The below sections provides a quick overview of each individual core component. It briefly describes what the component consists of and how it accomplished it's tasks and goals.

Scanner

An object exposing only initialazation and read only properties. The scanner attempts to transform a collection of named patterns (regular expression as token/type pairs) into a unique minimal DFA accepting any input specified while also containing an errors sink state for all invalid input (if required). The transformation begins by first checking the expression for validity while internalizing it. Next, the use of an augmented version of Dijkstra's shunting yard algorithm transforms the expression into prefix notation. From here Thompson's construction is utilized to produce an NFA with epsilon productions. The NFA is then directly converted into a minimal DFA with respect to reachable states using e-closure conversions which are cached. Finally, the minimal DFA is made total, if not already, so it can be further minimized with respect to nondistinguishable states using Hopcroft's algorithm. This results in the smallest possible total DFA which recognizes the given input. The input itself (regular expressions) must be specified following these guidelines:

  • supported core operators (and extensions) include:

    • '*' (kleene star -> repitition >= 0)
    • '+' (kleene plus -> repitition >= 1)
    • '.' (concatenation -> combine)
    • '|' (union -> choice -> either or)
    • '?' (question -> choice -> 1 or none)
    • [ab] (character class -> choice -> any specified character)
    • [a-c] (character range -> choice -> any char between the two)
    • [^ab] (character negation -> choice -> all but the specified)
    • () (grouping -> disambiguation -> any expression)
    • {n} (interval -> repitition -> exactly n)
    • {n,0} (interval -> repitition -> minimum n)
    • {n,m} (interval -> repitition -> between n (>= 0) and m)
  • other things to keep in mind (potential gotcha's):

    • full unicode input is supported
    • character ranges are determined by python's ord() function.
    • character negation is only over the printable ascii character set.
    • character classes/ranges may be combined, possibly multiple times, in the same set of brackets (e.g. [abc-pqrs-z]).
    • character ranges can be specified as forward or backwards, the results is the same (e.g. [a-c] or [c-a]).
    • if '^' is alone in the brackets (e.g. [^]) it is translated as any possible printable ascii character (i.e. a 'wildcard').
    • concatenation can be either implicit or explicit in the given input expression(s).

Parser

The parser attempts to transform a collection of BNF production rules into a parse table. The first step in constructing the resulting table is determining the terminal and nonterminal sets, which is very trivial. From here, the sets are used to construct the first and follow sets which identify what characters can be expected when parsing corresponding production rules. Subsequently, the first and follow sets are used to construct the predict sets which are in turn used in the table construction. Finally, the table is verified by checking for conflicts. While any BNF is accepted only LL(1) grammars will produce a valid parse table containing no conflicts. Furthermore, only left factored LL(1) grammars will produce a parse table with a single member per entry. To properly specify a LL(1) grammar the following requirements must be met:

  • No left recursion (direct or indirect)
  • Must be left factored
  • No ambiguity

If any of the above requirements are violated, or the grammar specified is not LL(1), there will be a conflict in the table. This means with the given set of input productions and with the aid of a single look ahead token it cannot be known what rule to choose in order to successfully produce a parse without backtracking.

Generator

The base generator is an object which all generators must inherit from. It is responsible for allowing easy configuration of options across all generators as well as providing basic protections against bad option combinations, program input, and generator output. It also allows easy reuse of language generators to compile many specifications into the same output language while allowing easy configuration option changes between generation.

Generators

The generators are wrappers on top of the scanner/parser objects and are responsible for translating/compiling the intermediate representations from the output of the scanner/parser into whatever context desired with the given options configured at the time of generation. Most of the time this will be a useable program in some choosen programming language, which should obey the scanner's function type as an input character stream to a token stream while the parser should take that same token stream to an abstract syntax tree (AST). However, this is not always the case since it is possible to write other generators, say for example cowsay. This illustrates the fact that the generators comprise the extensible portion of the package. This plug-in segment of SPaG will grow over time to subsume more generators into the core implementation. These implementations will be for programming languages only. While the newly subsumed generators will be immediately available if building from source, they will only be only available in the pip repository as a newer release of the SPaG package. Below is the status of all the generators currently tracked in the repository, linked to there implementation, and any applicable notes which may apply.

Status

Generator Status Notes
C DEVELOPING Direct scanner generator complete; Parser in progress
Go PLANNED
Python PLANNED

Script

A script is included in the package installation to allow easy generation of scanners and/or parsers (and possibly some combination of the two) into any number of output programming languages from the command line. It is a very light weight wrapper on the core functionality in the SPaG package. It also allows easy configuration of genration options and input specifications from command line or configuration file. Simply stated the CLI program drives the generation of the cross product of langauges x scanners x parser. Below shows various examples for the generator to produce scanner's and/or parser's for different token sets and LL(1) languages which may be found under the examples directory. Shown below are some basic invocation's for help, scanner, parser, and configuration file generation.

# Generate your scanner and/or parser! ...but first ask for help to see all
# available command line options.
$ spag_cli --help

# Generate any number of scanners (or parsers) for any set of languages at once!
$ spag_cli -s examples/INI/scanner.json examples/JSON/scanner.json -g c go
$ spag_cli -p examples/INI/parser.json examples/JSON/parser.json -g c go

# Generate a scanner/parser combo (possibly for a languages front-end reader).
$ spag_cli -s examples/INI/scanner.json -p examples/INI/parser.json -g c

# Generate a default configuration file for easier runtime configuration.
$ spag_cli --generate-rcfile .spagrc

# Control the generation [in/out]put with the specified configuration values.
$ spag_cli -c .spagrc

# NOTE: It is possible to override configuration file values with command line
# flag values if they are specified AFTER the configuration flag.
$ spag_cli -c .spagrc # Provide flag(s) for overwriting values here.

Development

Creating a new generator is extremely simple, as it only requires the addition of a single file. This file must contain a class definition following the proper naming conventions set forth for the new output language to be importable up by the package script. The below code block serves as a template for creating new generators which should be placed under spag/generators. The filename (denoted between curly brackets in the template below) should be named after the language being compiled to. Notice should also be taken between the capital and lowercase 'F'.

"""
A scanner/parser generator targeting {filename}.
"""
from spag import Generator


class {Filename}(Generator):
    """
    A simple object for compiling scanner's and/or parser's to {filename} programs.
    """

    def _translate(self):
        """
        Override this method in subclasses which should translate the (IR)
        intermediate representation of the given scanner and/or parser to the
        targeted language. It should return the files and there content in a
        dictionary allowing multiple files to be returned for a given language.
        This is the only required function a child generators must implement.

        Return:
          dict[str, str]: Generated file names/paths and associated content.
        """
        raise NotImplementedError('{filename} generator in development')

Once you have completed implementing the new generator it needs to be added to the list of supported generators. The tests should also be expanded to ensure the new generator can be properly imported as well. Additionally, the code must be linted, ensuring it follows the proper conventions set forth and therefore produces a similiar look and feel as the rest of the code in the repository. Linting, testing and installing can be automated through the use of the Makefile as shown below.

# Add the new generator to the __all__ list using your favorite editor.
$ vim spag/generators/__init__.py

# Ask the Makefile what it can do to help in linting, testing and installing.
$ make help

# Run sanity, to ensure no breakage occurs, which lints and tests the code
# against an installed version of the source in a virtual environment.
$ make sanity