You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
beautiful-racket/brag/brag/brag.scrbl

1063 lines
35 KiB
Plaintext

8 years ago
#lang scribble/manual
@(require scribble/eval
racket/date
file/md5
(for-label racket
brag/support
brag/lexer-support
brag/examples/nested-word-list
8 years ago
(only-in parser-tools/lex lexer-src-pos)
(only-in syntax/parse syntax-parse ~literal)))
@(define (lookup-date filename [default ""])
(cond
[(file-exists? filename)
(define modify-seconds (file-or-directory-modify-seconds filename))
(define a-date (seconds->date modify-seconds))
(date->string a-date)]
[else
default]))
@(define (compute-md5sum filename [default ""])
(cond [(file-exists? filename)
(bytes->string/utf-8 (call-with-input-file filename md5 #:mode 'binary))]
[else
default]))
@title{brag: the Beautiful Racket AST Generator}
8 years ago
@author["Danny Yoo (95%)" "Matthew Butterick (5%)"]
8 years ago
@defmodulelang[brag]
8 years ago
8 years ago
@section{Quick start}
8 years ago
@(define my-eval (make-base-eval))
@(my-eval '(require brag/examples/nested-word-list
8 years ago
racket/list
racket/match))
8 years ago
Suppose we're given the
8 years ago
following string:
@racketblock["(radiant (humble))"]
8 years ago
How would we turn this string into a structured value? That is, how would we @emph{parse} it? (Let's also suppose we've never heard of @racket[read].)
8 years ago
8 years ago
First, we need to consider the structure of the things we'd like to parse. The
string above looks like a nested list of words. Good start.
8 years ago
8 years ago
Second, how might we describe this formally — meaning, in a way that a computer could understand? A common notation to describe the structure of these things is @link["http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form"]{Backus-Naur Form} (BNF). So let's try to notate the structure of nested word lists in BNF.
8 years ago
@nested[#:style 'code-inset]{
@verbatim{
nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN
}}
8 years ago
What we intend by this notation is this: @racket[nested-word-list] is either a @racket[WORD], or a parenthesized list of @racket[nested-word-list]s. We use the character @litchar{*} to represent zero or more repetitions of the previous thing. We treat the uppercased @racket[LEFT-PAREN], @racket[RIGHT-PAREN], and @racket[WORD] as placeholders for @emph{tokens} (a @tech{token} being the smallest meaningful item in the parsed string):
8 years ago
Here are a few examples of tokens:
@interaction[#:eval my-eval
(require brag/support)
8 years ago
(token 'LEFT-PAREN)
(token 'WORD "crunchy" #:span 7)
(token 'RIGHT-PAREN)]
8 years ago
This BNF description is also known as a @deftech{grammar}. Just as it does in a natural language like English or French, a grammar describes something in terms of what elements can fit where.
8 years ago
8 years ago
Have we made progress? We have a valid grammar. But we're still missing a @emph{parser}: a function that can use that description to make structures out of a sequence of tokens.
8 years ago
8 years ago
Meanwhile, it's clear that we don't yet have a valid program because there's no @litchar{#lang} line. Let's add one: put @litchar{#lang brag} at the top of the grammar, and save it as a file called @filepath{nested-word-list.rkt}.
8 years ago
@filebox["nested-word-list.rkt"]{
@verbatim{
#lang brag
8 years ago
nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN
}}
8 years ago
Now it's a proper program. But what does it do?
8 years ago
@interaction[#:eval my-eval
@eval:alts[(require "nested-word-list.rkt") (void)]
parse
]
8 years ago
It gives us a @racket[parse] function. Let's investigate what @racket[parse]
does. What happens if we pass it a sequence of tokens?
8 years ago
@interaction[#:eval my-eval
(define a-parsed-value
(parse (list (token 'LEFT-PAREN "(")
(token 'WORD "some")
(token 'LEFT-PAREN "[")
(token 'WORD "pig")
(token 'RIGHT-PAREN "]")
(token 'RIGHT-PAREN ")"))))
a-parsed-value]
8 years ago
Those who have messed around with macros will recognize this as a @seclink["stx-obj" #:doc '(lib "scribblings/guide/guide.scrbl")]{syntax object}.
8 years ago
8 years ago
@interaction[#:eval my-eval
(syntax->datum a-parsed-value)
]
That's @racket[(some [pig])], essentially.
8 years ago
What happens if we pass our @racket[parse] function a bigger source of tokens?
8 years ago
@interaction[#:eval my-eval
@code:comment{tokenize: string -> (sequenceof token-struct?)}
@code:comment{Generate tokens from a string:}
(define (tokenize s)
(for/list ([str (regexp-match* #px"\\(|\\)|\\w+" s)])
(match str
["("
(token 'LEFT-PAREN str)]
[")"
(token 'RIGHT-PAREN str)]
[else
(token 'WORD str)])))
@code:comment{For example:}
(define token-source (tokenize "(welcome (to (((brag)) ())))"))
8 years ago
(define v (parse token-source))
(syntax->datum v)
]
Welcome to @tt{brag}.
8 years ago
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@section{Introduction}
8 years ago
@tt{brag} is a parser generator designed to be easy
8 years ago
to use:
8 years ago
@itemize[
8 years ago
@item{It provides a @litchar{#lang} for writing BNF grammars.
A module written in @litchar{#lang brag} automatically generates a
8 years ago
parser. The output of this parser tries to follow
8 years ago
@link["http://en.wikipedia.org/wiki/How_to_Design_Programs"]{HTDP}
8 years ago
guidelines. The structure of the grammar informs the structure of the
8 years ago
Racket syntax objects it generates.}
@item{The language uses a few conventions to simplify the expression of
8 years ago
grammars. The first rule in the grammar is assumed to be the
starting production. Identifiers in @tt{UPPERCASE} are treated as
terminal tokens. All other identifiers are treated as nonterminals.}
8 years ago
8 years ago
@item{Tokenizers can be developed independently of parsers.
@tt{brag} takes a liberal view on tokens: they can be strings,
8 years ago
symbols, or instances constructed with @racket[token]. Tokens can optionally provide source location, in which case a syntax object generated by the parser will too.}
8 years ago
8 years ago
@item{The parser can usually handle ambiguous grammars.}
8 years ago
8 years ago
@item{It integrates with the rest of the Racket
8 years ago
@link["http://docs.racket-lang.org/guide/languages.html"]{language toolchain}.}
]
@subsection{Example: a small DSL for ASCII diagrams}
8 years ago
@margin-note{This example is
@link["http://stackoverflow.com/questions/12345647/rewrite-this-script-by-designing-an-interpreter-in-racket"]{derived from a question} on Stack Overflow.}
To understand @tt{brag}'s design, let's look
at a toy problem. We'd like to define a language for
drawing simple ASCII diagrams. So if we write something like this:
8 years ago
@nested[#:style 'inset]{
@verbatim|{
3 9 X;
6 3 b 3 X 3 b;
3 9 X;
}|}
8 years ago
It should generate the following picture:
8 years ago
@nested[#:style 'inset]{
@verbatim|{
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
XXX
XXX
XXX
XXX
XXX
XXX
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
}|}
@subsection{Syntax and semantics}
8 years ago
8 years ago
We're being somewhat casual with what we mean by the program above. Let's try to nail down some meanings.
8 years ago
Each line of the program has a semicolon at the end, and describes the output of several @emph{rows} of the line drawing. Let's look at two of the lines in the example:
8 years ago
@itemize[
@item{@litchar{3 9 X;}: ``Repeat the following 3 times: print @racket["X"] nine times, followed by
a newline.''}
@item{@litchar{6 3 b 3 X 3 b;}: ``Repeat the following 6 times: print @racket[" "] three times,
followed by @racket["X"] three times, followed by @racket[" "] three times, followed by a newline.''}
]
Then each line consists of a @emph{repeat} number, followed by pairs of
8 years ago
(number, character) @emph{chunks}. We'll assume here that the intent of the lowercased character @litchar{b} is to represent the printing of a 1-character whitespace @racket[" "], and for other uppercase letters to represent the printing of themselves.
8 years ago
8 years ago
By understanding the pieces of each line, we can more easily capture that meaning in a grammar. Once we have each instruction of our ASCII DSL in a structured format, we should be able to parse it.
8 years ago
8 years ago
Here's a first pass at expressing the structure of these line-drawing programs.
8 years ago
@subsection{Parsing the concrete syntax}
8 years ago
8 years ago
@filebox["simple-line-drawing.rkt"]{
@verbatim|{
#lang brag
8 years ago
drawing: rows*
rows: repeat chunk+ ";"
repeat: INTEGER
chunk: INTEGER STRING
}|
}
@margin-note{@secref{brag-syntax} describes @tt{brag}'s syntax in more detail.}
8 years ago
We write a @tt{brag} program as an BNF grammar, where patterns can be:
8 years ago
@itemize[
@item{the names of other rules (e.g. @racket[chunk])}
@item{literal and symbolic token names (e.g. @racket[";"], @racket[INTEGER])}
@item{quantified patterns (e.g. @litchar{+} to represent one-or-more repetitions)}
]
The result of a @tt{brag} program is a module with a @racket[parse] function
8 years ago
that can parse tokens and produce a syntax object as a result.
Let's exercise this function:
@interaction[#:eval my-eval
(require brag/support)
8 years ago
@eval:alts[(require "simple-line-drawing.rkt")
(require brag/examples/simple-line-drawing)]
8 years ago
(define stx
(parse (list (token 'INTEGER 6)
(token 'INTEGER 2)
(token 'STRING " ")
(token 'INTEGER 3)
(token 'STRING "X")
";")))
(syntax->datum stx)
]
8 years ago
A @emph{token} is the smallest meaningful element of a source program. Tokens can be strings, symbols, or instances of the @racket[token] data structure. (Plus a few other special cases, which we'll discuss later.) Usually, a token holds a single character from the source program. But sometimes it makes sense to package a sequence of characters into a single token, if the sequence has an indivisible meaning.
8 years ago
8 years ago
If possible, we also want to attach source location information to each token. Why? Because this informatino will be incorporated into the syntax objects produced by @racket[parse].
8 years ago
8 years ago
A parser often works in conjunction with a helper function called a @emph{lexer} that converts the raw code of the source program into tokens. The @racketmodname[parser-tools/lex] library can help us write a position-sensitive
8 years ago
tokenizer:
@interaction[#:eval my-eval
(require parser-tools/lex)
(define (tokenize ip)
(port-count-lines! ip)
(define my-lexer
(lexer-src-pos
[(repetition 1 +inf.0 numeric)
(token 'INTEGER (string->number lexeme))]
[upper-case
(token 'STRING lexeme)]
["b"
(token 'STRING " ")]
[";"
(token ";" lexeme)]
[whitespace
(token 'WHITESPACE lexeme #:skip? #t)]
[(eof)
(void)]))
(define (next-token) (my-lexer ip))
next-token)
(define a-sample-input-port (open-input-string "6 2 b 3 X;"))
(define token-thunk (tokenize a-sample-input-port))
@code:comment{Now we can pass token-thunk to the parser:}
(define another-stx (parse token-thunk))
(syntax->datum another-stx)
@code:comment{The syntax object has location information:}
(syntax-line another-stx)
(syntax-column another-stx)
(syntax-span another-stx)
]
8 years ago
Note also from this lexer example:
8 years ago
@itemize[
8 years ago
@item{@racket[parse] accepts as input either a sequence of tokens, or a
function that produces tokens (which @racket[parse] will call repeatedly to get the next token).}
8 years ago
8 years ago
@item{As an alternative to the basic @racket[token] structure, a token can also be an instance of the @racket[position-token] structure (also found in @racketmodname[parser-tools/lex]). In that case, the token will try to derive its position from that of the position-token.}
8 years ago
8 years ago
@item{@racket[parse] will stop if it gets @racket[void] (or @racket['eof]) as a token.}
8 years ago
8 years ago
@item{@racket[parse] will skip any token that has
@racket[#:skip?] attribute set to @racket[#t]. For instance, tokens representing comments often use @racket[#:skip?].}
8 years ago
]
@subsection{From parsing to interpretation}
We now have a parser for programs written in this simple-line-drawing language.
8 years ago
Our parser will return syntax objects:
8 years ago
@interaction[#:eval my-eval
(define parsed-program
(parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
(syntax->datum parsed-program)
]
8 years ago
Better still, these syntax objects will have a predictable
structure that follows the grammar:
8 years ago
@racketblock[
(drawing (rows (repeat <number>)
(chunk <number> <string>) ... ";")
...)
]
where @racket[drawing], @racket[rows], @racket[repeat], and @racket[chunk]
should be treated literally, and everything else will be numbers or strings.
8 years ago
Still, these syntax-object values are just inert structures. How do we
interpret them, and make them @emph{print}? We claimed at the beginning of
this section that these syntax objects should be easy to interpret. So let's do it.
8 years ago
@margin-note{This is a very quick-and-dirty treatment of @racket[syntax-parse].
See the @racketmodname[syntax/parse] documentation for a gentler guide to its
features.} Racket provides a special form called @racket[syntax-parse] in the
8 years ago
@racketmodname[syntax/parse] library. @racket[syntax-parse] lets us do a
8 years ago
structural case-analysis on syntax objects: we provide it a set of patterns to
parse and actions to perform when those patterns match.
As a simple example, we can write a function that looks at a syntax object and
says @racket[#t] if it's the literal @racket[yes], and @racket[#f] otherwise:
@interaction[#:eval my-eval
(require syntax/parse)
@code:comment{yes-syntax-object?: syntax-object -> boolean}
@code:comment{Returns true if the syntax-object is yes.}
(define (yes-syntax-object? stx)
(syntax-parse stx
[(~literal yes)
#t]
[else
#f]))
(yes-syntax-object? #'yes)
(yes-syntax-object? #'nooooooooooo)
]
Here, we use @racket[~literal] to let @racket[syntax-parse] know that
8 years ago
@racket[yes] should show up literally in the syntax object. The patterns can
8 years ago
also have some structure to them, such as:
@racketblock[({~literal drawing} rows-stxs ...)]
which matches on syntax objects that begin, literally, with @racket[drawing],
followed by any number of rows (which are syntax objects too).
Now that we know a little bit more about @racket[syntax-parse],
we can use it to do a case analysis on the syntax
objects that our @racket[parse] function gives us.
We start by defining a function on syntax objects of the form @racket[(drawing
rows-stx ...)].
@interaction[#:eval my-eval
(define (interpret-drawing drawing-stx)
(syntax-parse drawing-stx
[({~literal drawing} rows-stxs ...)
(for ([rows-stx (syntax->list #'(rows-stxs ...))])
(interpret-rows rows-stx))]))]
When we encounter a syntax object with @racket[(drawing rows-stx
...)], then @racket[interpret-rows] each @racket[rows-stx].
@;The pattern we
@;express in @racket[syntax-parse] above marks what things should be treated
@;literally, and the @racket[...] is a a part of the pattern matching language
@;known by @racket[syntax-parse] that lets us match multiple instances of the
@;last pattern.
Let's define @racket[interpret-rows] now:
@interaction[#:eval my-eval
(define (interpret-rows rows-stx)
(syntax-parse rows-stx
[({~literal rows}
({~literal repeat} repeat-number)
chunks ... ";")
(for ([i (syntax-e #'repeat-number)])
(for ([chunk-stx (syntax->list #'(chunks ...))])
(interpret-chunk chunk-stx))
(newline))]))]
For a @racket[rows], we extract out the @racket[repeat-number] out of the
8 years ago
syntax object and use it as the range of the @racket[for] loop. The inner loop
8 years ago
walks across each @racket[chunk-stx] and calls @racket[interpret-chunk] on it.
8 years ago
Finally, we need to write a definition for @racket[interpret-chunk]. We want
8 years ago
it to extract out the @racket[chunk-size] and @racket[chunk-string] portions,
and print to standard output:
@interaction[#:eval my-eval
(define (interpret-chunk chunk-stx)
(syntax-parse chunk-stx
[({~literal chunk} chunk-size chunk-string)
(for ([k (syntax-e #'chunk-size)])
(display (syntax-e #'chunk-string)))]))
]
@margin-note{Here are the definitions in a single file:
@link["examples/simple-line-drawing/interpret.rkt"]{interpret.rkt}.}
With these definitions in hand, now we can pass it syntax objects
that we construct directly by hand:
@interaction[#:eval my-eval
(interpret-chunk #'(chunk 3 "X"))
(interpret-drawing #'(drawing (rows (repeat 5) (chunk 3 "X") ";")))
]
or we can pass it the result generated by our parser:
@interaction[#:eval my-eval
(define parsed-program
(parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
(interpret-drawing parsed-program)]
And now we've got an interpreter!
@subsection{From interpretation to compilation}
@margin-note{For a gentler tutorial on writing @litchar{#lang} extensions, see:
@link["http://hashcollision.org/brainfudge"]{F*dging up a Racket}.} (Just as a
warning: the following material is slightly more advanced, but shows how
writing a compiler for the line-drawing language reuses the ideas for the
interpreter.)
Wouldn't it be nice to be able to write something like:
@nested[#:style 'inset]{
@verbatim|{
3 9 X;
6 3 b 3 X 3 b;
3 9 X;
}|}
and have Racket automatically compile this down to something like this?
@racketblock[
(for ([i 3])
(for ([k 9]) (displayln "X"))
(newline))
(for ([i 6])
(for ([k 3]) (displayln " "))
(for ([k 3]) (displayln "X"))
(for ([k 3]) (displayln " "))
(newline))
(for ([i 3])
(for ([k 9]) (displayln "X"))
(newline))
]
Well, of course it won't work: we don't have a @litchar{#lang} line.
Let's add one.
@filebox["letter-i.rkt"]{
@verbatim|{
#lang brag/examples/simple-line-drawing
8 years ago
3 9 X;
6 3 b 3 X 3 b;
3 9 X;
}|
}
Now @filepath{letter-i.rkt} is a program.
How does this work? From the previous sections, we've seen how to take the
8 years ago
contents of a file and interpret it. What we want to do now is teach Racket
how to compile programs labeled with this @litchar{#lang} line. We'll do two
8 years ago
things:
@itemize[
@item{Tell Racket to use the @tt{brag}-generated parser and lexer we defined
8 years ago
earlier whenever it sees a program written with
@litchar{#lang brag/examples/simple-line-drawing}.}
8 years ago
@item{Define transformation rules for @racket[drawing], @racket[rows], and
@racket[chunk] to rewrite these into standard Racket forms.}
]
The second part, the writing of the transformation rules, will look very
similar to the definitions we wrote for the interpreter, but the transformation
8 years ago
will happen at compile-time. (We @emph{could} just resort to simply calling
8 years ago
into the interpreter we just wrote up, but this section is meant to show that
compilation is also viable.)
We do the first part by defining a @emph{module reader}: a
@link["http://docs.racket-lang.org/guide/syntax_module-reader.html"]{module
8 years ago
reader} tells Racket how to parse and compile a file. Whenever Racket sees a
8 years ago
@litchar{#lang <name>}, it looks for a corresponding module reader in
@filepath{<name>/lang/reader}.
Here's the definition for
@filepath{brag/examples/simple-line-drawing/lang/reader.rkt}:
8 years ago
@filebox["brag/examples/simple-line-drawing/lang/reader.rkt"]{
8 years ago
@codeblock|{
#lang s-exp syntax/module-reader
brag/examples/simple-line-drawing/semantics
8 years ago
#:read my-read
#:read-syntax my-read-syntax
#:whole-body-readers? #t
(require brag/examples/simple-line-drawing/lexer
brag/examples/simple-line-drawing/grammar)
8 years ago
(define (my-read in)
(syntax->datum (my-read-syntax #f in)))
(define (my-read-syntax src ip)
(list (parse src (tokenize ip))))
}|
}
We use a helper module @racketmodname[syntax/module-reader], which provides
8 years ago
utilities for creating a module reader. It uses the lexer and
@tt{brag}-generated parser we defined earlier, and also tells Racket that it should compile the forms in the syntax
8 years ago
object using a module called @filepath{semantics.rkt}.
@margin-note{For a systematic treatment on capturing the semantics of
a language, see @link["http://cs.brown.edu/~sk/Publications/Books/ProgLangs/"]{Programming Languages: Application and
Interpretation}.}
Let's look into @filepath{semantics.rkt} and see what's involved in
compilation:
@filebox["brag/examples/simple-line-drawing/semantics.rkt"]{
8 years ago
@codeblock|{
#lang racket/base
(require (for-syntax racket/base syntax/parse))
(provide #%module-begin
;; We reuse Racket's treatment of raw datums, specifically
;; for strings and numbers:
#%datum
;; And otherwise, we provide definitions of these three forms.
;; During compiliation, Racket uses these definitions to
;; rewrite into for loops, displays, and newlines.
drawing rows chunk)
;; Define a few compile-time functions to do the syntax rewriting:
(begin-for-syntax
(define (compile-drawing drawing-stx)
(syntax-parse drawing-stx
[({~literal drawing} rows-stxs ...)
(syntax/loc drawing-stx
(begin rows-stxs ...))]))
(define (compile-rows rows-stx)
(syntax-parse rows-stx
[({~literal rows}
({~literal repeat} repeat-number)
chunks ...
";")
(syntax/loc rows-stx
(for ([i repeat-number])
chunks ...
(newline)))]))
(define (compile-chunk chunk-stx)
(syntax-parse chunk-stx
[({~literal chunk} chunk-size chunk-string)
(syntax/loc chunk-stx
(for ([k chunk-size])
(display chunk-string)))])))
;; Wire up the use of "drawing", "rows", and "chunk" to these
;; transformers:
(define-syntax drawing compile-drawing)
(define-syntax rows compile-rows)
(define-syntax chunk compile-chunk)
}|
}
The semantics hold definitions for @racket[compile-drawing],
@racket[compile-rows], and @racket[compile-chunk], similar to what we had for
interpretation with @racket[interpret-drawing], @racket[interpret-rows], and
8 years ago
@racket[interpret-chunk]. However, compilation is not the same as
8 years ago
interpretation: each definition does not immediately execute the act of
drawing, but rather returns a syntax object whose evaluation will do the actual
work.
There are a few things to note:
@itemize[
@item{@tt{brag}'s native data structure is the syntax object because the
8 years ago
majority of Racket's language-processing infrastructure knows how to read and
write this structured value.}
@item{
@margin-note{By the way, we can just as easily rewrite the semantics so that
8 years ago
@racket[compile-rows] does explicitly call @racket[compile-chunk]. Often,
8 years ago
though, it's easier to write the transformation functions in this piecemeal way
and depend on the Racket macro expansion system to do the rewriting as it
encounters each of the forms.}
Unlike in interpretation, @racket[compile-rows] doesn't
8 years ago
compile each chunk by directly calling @racket[compile-chunk]. Rather, it
8 years ago
depends on the Racket macro expander to call each @racket[compile-XXX] function
as it encounters a @racket[drawing], @racket[rows], or @racket[chunk] in the
8 years ago
parsed value. The three statements at the bottom of @filepath{semantics.rkt} inform
8 years ago
the macro expansion system to do this:
@racketblock[
(define-syntax drawing compile-drawing)
(define-syntax rows compile-rows)
(define-syntax chunk compile-chunk)
]}
]
Altogether, @tt{brag}'s intent is to be a parser generator generator for Racket
8 years ago
that's easy and fun to use. It's meant to fit naturally with the other tools
in the Racket language toolchain. Hopefully, it will reduce the friction in
8 years ago
making new languages with alternative concrete syntaxes.
The rest of this document describes the @tt{brag} language and the parsers it
8 years ago
generates.
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@section{The language}
@subsection[#:tag "brag-syntax"]{Syntax and terminology}
A program in the @tt{brag} language consists of the language line
@litchar{#lang brag}, followed by a collection of @tech{rule}s and
8 years ago
@tech{line comment}s.
A @deftech{rule} is a sequence consisting of: a @tech{rule identifier}, a colon
@litchar{":"}, and a @tech{pattern}.
A @deftech{rule identifier} is an @tech{identifier} that is not in upper case.
A @deftech{token identifier} is an @tech{identifier} that is in upper case.
An @deftech{identifier} is a character sequence of letters, numbers, and
8 years ago
characters in @racket["-.!$%&/<=>?^_~@"]. It must not contain
8 years ago
@litchar{*} or @litchar{+}, as those characters are used to denote
quantification.
A @deftech{pattern} is one of the following:
@itemize[
@item{an implicit sequence of @tech{pattern}s separated by whitespace}
@item{a terminal: either a literal string or a @tech{token identifier}}
@item{a @tech{rule identifier}}
@item{a @deftech{choice pattern}: a sequence of @tech{pattern}s delimited with @litchar{|} characters.}
@item{a @deftech{quantifed pattern}: a @tech{pattern} followed by either @litchar{*} (``zero or more'') or @litchar{+} (``one or more'')}
@item{an @deftech{optional pattern}: a @tech{pattern} surrounded by @litchar{[} and @litchar{]}}
@item{an explicit sequence: a @tech{pattern} surrounded by @litchar{(} and @litchar{)}}]
A @deftech{line comment} begins with either @litchar{#} or @litchar{;} and
continues till the end of the line.
For example, in the following program:
@nested[#:style 'inset
@verbatim|{
#lang brag
8 years ago
;; A parser for a silly language
sentence: verb optional-adjective object
verb: greeting
optional-adjective: ["happy" | "frumpy"]
greeting: "hello" | "hola" | "aloha"
object: "world" | WORLD
}|]
the elements @tt{sentence}, @tt{verb}, @tt{greeting}, and @tt{object} are rule
8 years ago
identifiers. The first rule, @litchar{sentence: verb optional-adjective
8 years ago
object}, is a rule whose right side is an implicit pattern sequence of three
8 years ago
sub-patterns. The uppercased @tt{WORLD} is a token identifier. The fourth rule in the program associates @tt{greeting} with a @tech{choice pattern}.
8 years ago
More examples:
@itemize[
@item{A
BNF for binary
8 years ago
strings that contain an equal number of zeros and ones.
@verbatim|{
#lang brag
8 years ago
equal: [zero one | one zero] ;; equal number of "0"s and "1"s.
zero: "0" equal | equal "0" ;; has an extra "0" in it.
one: "1" equal | equal "1" ;; has an extra "1" in it.
}|
}
@item{A BNF for
8 years ago
@link["http://www.json.org/"]{JSON}-like structures.
@verbatim|{
#lang brag
8 years ago
json: number | string
| array | object
number: NUMBER
string: STRING
array: "[" [json ("," json)*] "]"
object: "{" [kvpair ("," kvpair)*] "}"
kvpair: ID ":" json
}|
}
]
@subsection{Syntax errors}
Besides the basic syntax errors that can occur with a malformed grammar, there
are a few other classes of situations that @litchar{#lang brag} will consider
8 years ago
as syntax errors.
@tt{brag} will raise a syntax error if the grammar:
8 years ago
@itemize[
@item{doesn't have any rules.}
@item{has a rule with the same left hand side as any other rule.}
8 years ago
@item{refers to rules that have not been defined. e.g. the
8 years ago
following program:
@nested[#:style 'code-inset
@verbatim|{
#lang brag
8 years ago
foo: [bar]
}|
]
should raise an error because @tt{bar} has not been defined, even though
@tt{foo} refers to it in an @tech{optional pattern}.}
@item{uses the token name @racket[EOF]; the end-of-file token type is reserved
for internal use by @tt{brag}.}
8 years ago
8 years ago
@item{contains a rule that has no finite derivation. e.g. the following
8 years ago
program:
@nested[#:style 'code-inset
@verbatim|{
#lang brag
8 years ago
infinite-a: "a" infinite-a
}|
]
should raise an error because no finite sequence of tokens will satisfy
@tt{infinite-a}.}
]
Otherwise, @tt{brag} should be fairly tolerant and permit even ambiguous
8 years ago
grammars.
@subsection{Semantics}
@declare-exporting[brag/examples/nested-word-list]
8 years ago
A program written in @litchar{#lang brag} produces a module that provides a few
8 years ago
bindings. The most important of these is @racket[parse]:
8 years ago
@defproc[(parse [source any/c #f]
[token-source (or/c (sequenceof token)
(-> token))])
syntax?]{
Parses the sequence of @tech{tokens} according to the rules in the grammar, using the
8 years ago
first rule as the start production. The parse must completely consume
8 years ago
@racket[token-source].
The @deftech{token source} can either be a sequence, or a 0-arity function that
produces @tech{tokens}.
A @deftech{token} in @tt{brag} can be any of the following values:
8 years ago
@itemize[
@item{a string}
@item{a symbol}
@item{an instance produced by @racket[token]}
@item{an instance produced by the token constructors of @racketmodname[parser-tools/lex]}
@item{an instance of @racketmodname[parser-tools/lex]'s @racket[position-token] whose
@racket[position-token-token] is a @tech{token}.}
]
A token whose type is either @racket[void] or @racket['EOF] terminates the
source.
8 years ago
If @racket[parse] succeeds, it will return a structured syntax object. The
8 years ago
structure of the syntax object follows the overall structure of the rules in
8 years ago
the BNF grammar. For each rule @racket[r] and its associated pattern @racket[p],
8 years ago
@racket[parse] generates a syntax object @racket[#'(r p-value)] where
@racket[p-value]'s structure follows a case analysis on @racket[p]:
@itemize[
@item{For implicit and explicit sequences of @tech{pattern}s @racket[p1],
@racket[p2], ..., the corresponding values, spliced into the
structure.}
@item{For terminals, the value associated to the token.}
@item{For @tech{rule identifier}s: the associated parse value for the rule.}
@item{For @tech{choice pattern}s: the associated parse value for one of the matching subpatterns.}
@item{For @tech{quantifed pattern}s and @tech{optional pattern}s: the corresponding values, spliced into the structure.}
]
Consequently, it's only the presence of @tech{rule identifier}s in a rule's
pattern that informs the parser to introduces nested structure into the syntax
object.
If the grammar has ambiguity, @tt{brag} will choose and return a parse, though
8 years ago
it does not guarantee which one it chooses.
If the parse cannot be performed successfully, or if a token in the
@racket[token-source] uses a type that isn't mentioned in the grammar, then
@racket[parse] raises an instance of @racket[exn:fail:parsing].}
8 years ago
@defproc[(parse-tree [source any/c #f]
[token-source (or/c (sequenceof token)
(-> token))])
list?]{
Same as @racket[parse], but the result is converted into a visible parse tree. Useful for testing or debugging a parser.
}
8 years ago
@defform[#:id make-rule-parser
(make-rule-parser name)]{
Constructs a parser for the @racket[name] of one of the non-terminals
8 years ago
in the grammar.
8 years ago
For example, given the @tt{brag} program
8 years ago
@filepath{simple-arithmetic-grammar.rkt}:
@filebox["simple-arithmetic-grammar.rkt"]{
@verbatim|{
#lang brag
8 years ago
expr : term ('+' term)*
term : factor ('*' factor)*
factor : INT
}|
}
the following interaction shows how to extract a parser for @racket[term]s.
@interaction[#:eval my-eval
@eval:alts[(require "simple-arithmetic-grammar.rkt")
(require brag/examples/simple-arithmetic-grammar)]
8 years ago
(define term-parse (make-rule-parser term))
(define tokens (list (token 'INT 3)
"*"
(token 'INT 4)))
(syntax->datum (parse tokens))
(syntax->datum (term-parse tokens))
(define another-token-sequence
(list (token 'INT 1) "+" (token 'INT 2)
"*" (token 'INT 3)))
(syntax->datum (parse another-token-sequence))
@code:comment{Note that term-parse will break on another-token-sequence}
@code:comment{as it does not know what to do with the "+"}
(term-parse another-token-sequence)
]
}
@defthing[all-token-types (setof symbol?)]{
A set of all the token types used in a grammar.
For example:
@interaction[#:eval my-eval
@eval:alts[(require "simple-arithmetic-grammar.rkt")
(require brag/examples/simple-arithmetic-grammar)]
8 years ago
all-token-types
]
}
@section{Support API}
@defmodule[brag/support]
8 years ago
The @racketmodname[brag/support] module provides functions to interact with
8 years ago
@tt{brag} programs. The most useful is the @racket[token] function, which
8 years ago
produces tokens to be parsed.
@defproc[(token [type (or/c string? symbol?)]
[val any/c #f]
[#:line line (or/c positive-integer? #f) #f]
[#:column column (or/c natural-number? #f) #f]
[#:offset offset (or/c positive-integer? #f) #f]
[#:span span (or/c natural-number? #f) #f]
[#:skip? skip? boolean? #f]
)
token-struct?]{
Creates instances of @racket[token-struct]s.
The syntax objects produced by a parse will inject the value @racket[val] in
place of the token name in the grammar.
If @racket[#:skip?] is true, then the parser will skip over it during a
parse.}
@defstruct[token-struct ([type symbol?]
[val any/c]
[offset (or/c positive-integer? #f)]
[line (or/c natural-number? #f)]
[column (or/c positive-integer? #f)]
[span (or/c natural-number? #f)]
[skip? boolean?])
#:transparent]{
The token structure type.
Rather than directly using the @racket[token-struct] constructor, please use
the helper function @racket[token] to construct instances.
}
@defstruct[(exn:fail:parsing exn:fail)
([message string?]
[continuation-marks continuation-mark-set?]
[srclocs (listof srcloc?)])]{
The exception raised when parsing fails.
@racket[exn:fail:parsing] implements Racket's @racket[prop:exn:srcloc]
property, so if this exception reaches DrRacket's default error handler,
DrRacket should highlight the offending locations in the source.}
@section{Lexer support API}
8 years ago
@defmodule[brag/lexer-support]
8 years ago
In addition to the exports shown below, the @racketmodname[brag/lexer-support] module also provides everything from @racketmodname[brag/support], and everything from @racketmodname[parser-tools/lex].
@defproc[(apply-tokenizer [tokenizer procedure?]
[source-string (or/c string?
input-port?)])
list?]{
Repeatedly apply @racket[tokenizer] to @racket[source-string], gathering the resulting tokens into a list. Useful for testing or debugging a tokenizer.
}
@defproc[(trim-delimiters [left-delimiter string?]
[str string?]
[right-delimiter string?])
string?]{
Remove @racket[left-delimiter] from the left side of @racket[str], and @racket[right-delimiter] from its right side. Intended as a helper function for @racket[delimited-by].
}
@defform[(:* re ...)]{
Repetition of @racket[re] sequence 0 or more times.}
@defform[(:+ re ...)]{
Repetition of @racket[re] sequence 1 or more times.}
@defform[(:? re ...)]{
Zero or one occurrence of @racket[re] sequence.}
@defform[(:= n re ...)]{
Exactly @racket[n] occurrences of @racket[re] sequence, where
@racket[n] must be a literal exact, non-negative number.}
@defform[(:>= n re ...)]{
At least @racket[n] occurrences of @racket[re] sequence, where
@racket[n] must be a literal exact, non-negative number.}
@defform[(:** n m re ...)]{
Between @racket[n] and @racket[m] (inclusive) occurrences of
@racket[re] sequence, where @racket[n] must be a literal exact,
non-negative number, and @racket[m] must be literally either
@racket[#f], @racket[+inf.0], or an exact, non-negative number; a
@racket[#f] value for @racket[m] is the same as @racket[+inf.0].}
@defform[(:or re ...)]{
Same as @racket[(union re ...)].}
@deftogether[(
@defform[(:: re ...)]
@defform[(:seq re ...)]
)]{
Both forms concatenate the @racket[re]s.}
@defform[(:& re ...)]{
Intersects the @racket[re]s.}
@defform[(:- re ...)]{
The set difference of the @racket[re]s.}
@defform[(:~ re ...)]{
Character-set complement, which each @racket[re] must match exactly
one character.}
@defform[(:/ char-or-string ...)]{
Character ranges, matching characters between successive pairs of
characters.}
8 years ago
@defform[(delimited-by open close)]{
A string that is bounded by the @racket[open] and @racket[close] delimiters. Matching is non-greedy (meaning, it stops at the first occurence of @racket[close]). The resulting lexeme includes the delimiters. To remove them, see @racket[trim-delimiters].}
8 years ago
@close-eval[my-eval]