peggy—PEG parsing in Python

5th of November, 2014

This morning, to scratch an itch1, I wrote a PEG implementation in Python.

I took a bit of an unusual approach to it, which I’ll explain here and I hope you may find interesting.

I am, I confess, a Haskell programmer on the sly. Haskell is quite interesting for giving one a bit of a different perspective on how code should look.

In particular, here, the design process for the imperative mindset is to ask what a parser does. I think what’s an interesting alternative approach is the declarative mindset, where we ask what a parser is.

To a first approximation, a parser is this (if you’ll forgive the Haskell syntax):

type Parser a = String -> Maybe (a, String)

Transliterating this to English: a parser which parses Strings to as is: a function which takes an a and returns either an a and the rest of the unparsed input, or nothing (in case of a parse error).

In Haskell, this takes the shape of a stack of monad transformers. Indeed, if we restate that type with some newtype wrappers:

type Parser = StateT String Maybe

We get most of the functionality we want “for free”. If we throw in the following extra utility function:

dot :: Parser Char
dot = do (x:xs) <- get
         put xs
         return x

That’s enough for the majority of basic parsing2.

Sequencing is achieved through <*> and >> which both come with the Applicative instance for StateT String Maybe. Ordered choice and empty come from <|> and empty in the Alternative instance for StateT String Maybe. In fact, the ?, + and * operators also come from the Alternative instance as optional, some and many.

I think this makes rather a nice example for Haskell: a pretty full set of parsing combinators in five lines of code, one of which is just type signature.

Python

I wanted to translate this formulation into Python to produce a relatively small and clean PEG implementation. The basic idea was a Parser wrapper class around these parser functions.

A Python parser function here is one that takes an input string and either:

The other combinators can all be built up from that basic building block.

The implementation is about 140 lines of Python.

  1. PyPEG, the current state-of-the-art Python PEG library, is full-featured but rather unwieldy in my opinion.

  2. This is not complete for PEGs as it is missing the & and !-predicates.