How would we turn this string into a structured value? That is, how would we @emph{parse} it? (Let's also suppose we've never heard of @racket[read].)
First, we need to consider the structure of the things we'd like to parse. The
string above looks like a nested list of words. Good start.
Second, how might we describe this formally —meaning, in a way that a computer could understand? A common notation to describe the structure of these things is @link["http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form"]{Backus-Naur Form} (BNF). So let's try to notate the structure of nested word lists in BNF.
@nested[#:style 'code-inset]{
@verbatim{
nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN
}}
What we intend by this notation is this: @racket[nested-word-list] is either a @racket[WORD], or a parenthesized list of @racket[nested-word-list]s. We use the character @litchar{*} to represent zero or more repetitions of the previous thing. We treat the uppercased @racket[LEFT-PAREN], @racket[RIGHT-PAREN], and @racket[WORD] as placeholders for @emph{tokens} (a @tech{token} being the smallest meaningful item in the parsed string):
Here are a few examples of tokens:
@interaction[#:eval my-eval
(require brag/support)
(token 'LEFT-PAREN)
(token 'WORD "crunchy" #:span 7)
(token 'RIGHT-PAREN)]
This BNF description is also known as a @deftech{grammar}. Just as it does in a natural language like English or French, a grammar describes something in terms of what elements can fit where.
Have we made progress? We have a valid grammar. But we're still missing a @emph{parser}: a function that can use that description to make structures out of a sequence of tokens.
Meanwhile, it's clear that we don't yet have a valid program because there's no @litchar{#lang} line. Let's add one: put @litchar{#lang brag} at the top of the grammar, and save it as a file called @filepath{nested-word-list.rkt}.
@item{The language uses a few conventions to simplify the expression of
grammars. The first rule in the grammar is assumed to be the
starting production. Identifiers in @tt{UPPERCASE} are treated as
terminal tokens. All other identifiers are treated as nonterminals.}
@item{Tokenizers can be developed independently of parsers.
@tt{brag} takes a liberal view on tokens: they can be strings,
symbols, or instances constructed with @racket[token]. Tokens can optionally provide source location, in which case a syntax object generated by the parser will too.}
@item{The parser can usually handle ambiguous grammars.}
@subsection{Example: a small DSL for ASCII diagrams}
Suppose we'd like to define a language for
drawing simple ASCII diagrams. So if we write something like this:
@nested[#:style 'inset]{
@verbatim|{
3 9 X;
6 3 b 3 X 3 b;
3 9 X;
}|}
It should generate the following picture:
@nested[#:style 'inset]{
@verbatim|{
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
XXX
XXX
XXX
XXX
XXX
XXX
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
}|}
This makes sense in a casual way. But let's be more precise about how the language works.
Each line of the program has a semicolon at the end, and describes the output of several @emph{rows} of the line drawing. Let's look at two of the lines in the example:
@itemize[
@item{@litchar{3 9 X;}: ``Repeat the following 3 times: print @racket["X"] nine times, followed by
a newline.''}
@item{@litchar{6 3 b 3 X 3 b;}: ``Repeat the following 6 times: print @racket[" "] three times,
followed by @racket["X"] three times, followed by @racket[" "] three times, followed by a newline.''}
]
Then each line consists of a @emph{repeat} number, followed by pairs of
(number, character) @emph{chunks}. We'll assume here that the intent of the lowercased character @litchar{b} is to represent the printing of a 1-character whitespace @racket[" "], and for other uppercase letters to represent the printing of themselves.
By understanding the pieces of each line, we can more easily capture that meaning in a grammar. Once we have each instruction of our ASCII DSL in a structured format, we should be able to parse it.
Here's a first pass at expressing the structure of these line-drawing programs.
@subsection{Parsing the concrete syntax}
@filebox["simple-line-drawing.rkt"]{
@verbatim|{
#lang brag
drawing: rows*
rows: repeat chunk+ ";"
repeat: INTEGER
chunk: INTEGER STRING
}|
}
@margin-note{@secref{brag-syntax} describes @tt{brag}'s syntax in more detail.}
We write a @tt{brag} program as an BNF grammar, where patterns can be:
@itemize[
@item{the names of other rules (e.g. @racket[chunk])}
@item{literal and symbolic token names (e.g. @racket[";"], @racket[INTEGER])}
@item{quantified patterns (e.g. @litchar{+} to represent one-or-more repetitions)}
]
The result of a @tt{brag} program is a module with a @racket[parse] function
that can parse tokens and produce a syntax object as a result.
Let's try this function:
@interaction[#:eval my-eval
(require brag/support)
@eval:alts[(require "simple-line-drawing.rkt")
(require brag/examples/simple-line-drawing)]
(define stx
(parse (list (token 'INTEGER 6)
(token 'INTEGER 2)
(token 'STRING " ")
(token 'INTEGER 3)
(token 'STRING "X")
";")))
(syntax->datum stx)
]
A @emph{token} is the smallest meaningful element of a source program. Tokens can be strings, symbols, or instances of the @racket[token] data structure. (Plus a few other special cases, which we'll discuss later.) Usually, a token holds a single character from the source program. But sometimes it makes sense to package a sequence of characters into a single token, if the sequence has an indivisible meaning.
If possible, we also want to attach source location information to each token. Why? Because this informatino will be incorporated into the syntax objects produced by @racket[parse].
A parser often works in conjunction with a helper function called a @emph{lexer} that converts the raw code of the source program into tokens. The @racketmodname[br-parser-tools/lex] library can help us write a position-sensitive
tokenizer:
@interaction[#:eval my-eval
(require br-parser-tools/lex)
(define (tokenize ip)
(port-count-lines! ip)
(define my-lexer
(lexer-src-pos
[(repetition 1 +inf.0 numeric)
(token 'INTEGER (string->number lexeme))]
[upper-case
(token 'STRING lexeme)]
["b"
(token 'STRING " ")]
[";"
(token ";" lexeme)]
[whitespace
(token 'WHITESPACE lexeme #:skip? #t)]
[(eof)
(void)]))
(define (next-token) (my-lexer ip))
next-token)
(define a-sample-input-port (open-input-string "6 2 b 3 X;"))
@code:comment{Now we can pass token-thunk to the parser:}
(define another-stx (parse token-thunk))
(syntax->datum another-stx)
@code:comment{The syntax object has location information:}
(syntax-line another-stx)
(syntax-column another-stx)
(syntax-span another-stx)
]
Note also from this lexer example:
@itemize[
@item{@racket[parse] accepts as input either a sequence of tokens, or a
function that produces tokens (which @racket[parse] will call repeatedly to get the next token).}
@item{As an alternative to the basic @racket[token] structure, a token can also be an instance of the @racket[position-token] structure (also found in @racketmodname[br-parser-tools/lex]). In that case, the token will try to derive its position from that of the position-token.}
@item{@racket[parse] will stop if it gets @racket[void] (or @racket['eof]) as a token.}
@item{@racket[parse] will skip any token that has
@racket[#:skip?] attribute set to @racket[#t]. For instance, tokens representing comments often use @racket[#:skip?].}
]
@subsection{From parsing to interpretation}
We now have a parser for programs written in this simple-line-drawing language.
Our parser will return syntax objects:
@interaction[#:eval my-eval
(define parsed-program
(parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
(syntax->datum parsed-program)
]
Better still, these syntax objects will have a predictable
structure that follows the grammar:
@racketblock[
(drawing (rows (repeat <number>)
(chunk <number> <string>) ... ";")
...)
]
where @racket[drawing], @racket[rows], @racket[repeat], and @racket[chunk]
should be treated literally, and everything else will be numbers or strings.
Still, these syntax-object values are just inert structures. How do we
interpret them, and make them @emph{print}? We claimed at the beginning of
this section that these syntax objects should be easy to interpret. So let's do it.
@margin-note{This is a very quick-and-dirty treatment of @racket[syntax-parse].
See the @racketmodname[syntax/parse] documentation for a gentler guide to its
features.} Racket provides a special form called @racket[syntax-parse] in the
@racketmodname[syntax/parse] library. @racket[syntax-parse] lets us do a
structural case-analysis on syntax objects: we provide it a set of patterns to
parse and actions to perform when those patterns match.
As a simple example, we can write a function that looks at a syntax object and
says @racket[#t] if it's the literal @racket[yes], and @racket[#f] otherwise:
@subsection[#:tag "brag-syntax"]{Syntax and terminology}
A program in the @tt{brag} language consists of the language line
@litchar{#lang brag}, followed by a collection of @tech{rule}s and
@tech{line comment}s.
A @deftech{rule} is a sequence consisting of: a @tech{rule identifier}, a colon
@litchar{":"}, and a @tech{pattern}.
A @deftech{rule identifier} is an @tech{identifier} that is not in upper case.
A @deftech{symbolic token identifier} is an @tech{identifier} that is in upper case.
An @deftech{identifier} is a character sequence of letters, numbers, and
characters in @racket["-.!$%&/<=>?^_~@"]. It must not contain
@litchar{*} or @litchar{+}, as those characters are used to denote
quantification.
A @deftech{pattern} is one of the following:
@itemize[
@item{an implicit sequence of @tech{pattern}s separated by whitespace}
@item{a terminal: either a literal string or a @tech{symbolic token identifier}.
When used in a pattern, both these terminals will match the same set of inputs. A literal string can match the string itself, or a @racket[token] whose type field contains that string (or its symbol form). So @racket["FOO"] would match @racket["FOO"], @racket[(token "FOO" "bar")], or @racket[(token 'FOO "bar")]. A symbolic token identifier can also match the string version of the identifier, or a @racket[token] whose type field is the symbol or string form of the identifier. So @racket[FOO] would also match @racket["FOO"], @racket[(token 'FOO "bar")], or @racket[(token "FOO" "bar")]. (In every case, the value of a token, like @racket["bar"], can be anything, and may or may not be the same as its type.)
Because their underlying meanings are the same, the symbolic token identifier ends up being a notational convenience for readability inside a grammar pattern. Typically, the literal string @racket["FOO"] is used to connote ``match the string @racket["FOO"] exactly'' and the symbolic token identifier @racket[FOO] specially connotes ``match any token of type @racket['FOO]''.}
@item{a @tech{rule identifier}}
@item{a @deftech{choice pattern}: a sequence of @tech{pattern}s delimited with @litchar{|} characters.}
@item{a @deftech{quantifed pattern}: a @tech{pattern} followed by either @litchar{*} (``zero or more'') or @litchar{+} (``one or more'')}
@item{an @deftech{optional pattern}: a @tech{pattern} surrounded by @litchar{[} and @litchar{]}}
@item{an explicit sequence: a @tech{pattern} surrounded by @litchar{(} and @litchar{)}}]
A @deftech{line comment} begins with either @litchar{#} or @litchar{;} and
continues till the end of the line.
For example, in the following program:
@nested[#:style 'inset
@verbatim|{
#lang brag
;; A parser for a silly language
sentence: verb optional-adjective object
verb: greeting
optional-adjective: ["happy" | "frumpy"]
greeting: "hello" | "hola" | "aloha"
object: "world" | WORLD
}|]
the elements @tt{sentence}, @tt{verb}, @tt{greeting}, and @tt{object} are rule
identifiers. The first rule, @litchar{sentence: verb optional-adjective
object}, is a rule whose right side is an implicit pattern sequence of three
sub-patterns. The uppercased @tt{WORLD} is a symbolic token identifier. The fourth rule in the program associates @tt{greeting} with a @tech{choice pattern}.
More examples:
@itemize[
@item{A
BNF for binary
strings that contain an equal number of zeros and ones.
@verbatim|{
#lang brag
equal: [zero one | one zero] ;; equal number of "0"s and "1"s.
zero: "0" equal | equal "0" ;; has an extra "0" in it.
one: "1" equal | equal "1" ;; has an extra "1" in it.
By default, every matched token shows up in the parse tree. But sometimes that means that the parse tree ends up holding a bunch of tokens that were only needed to complete the parsing. Once they've served their purpose, it's sometimes useful to filter them out (for instance, to simplify the implementation of a language expander). To help with this kind of housekeeping, @racket[brag] supports @emph{cuts} and @emph{splices}.
A @deftech{cut} in a grammar will delete an item from the parse tree. A cut is notated by prefixing either the left-hand rule name or a right-hand pattern element with a slash@litchar{/}.
If the cut is applied to a left-hand rule name, the rule name is omitted from the parse tree, but its node and its matched elements remain.
If the cut is applied to a right-hand pattern element, then that element is omitted from every node matching that rule.
For instance, consider this simple grammar for arithmetic expressions:
Suppose we felt the @litchar{+} and @litchar{*} characters were superfluous. We can add cuts to the grammar by prefixing these pattern elements with @litchar{/}:
A @deftech{splice} in a grammar will merge the elements of a node into the surrounding node. A splice is notated by prefixing either the left-hand rule name or a right-hand pattern element with an at sign@litchar|{@}|.
If the splice is applied to a left-hand rule name, then the splice is applied every time the rule is used in the parse tree.
If the splice is applied to a right-hand pattern element, that element is spliced only when it appears as part of the production for that rule.
Suppose we remove the cut from the @racket[factor] rule name and instead splice the second appearance of @racket[factor] in the pattern for the @racket[term] rule:
@verbatim|{
#lang brag
expr : term (/'+' term)*
term : factor (/'*' @factor)*
factor : ("0" | "1" | "2" | "3"
| "4" | "5" | "6" | "7"
| "8" | "9")+
}|
The @racket[factor] elements matching the first position of the @racket[term] pattern remain as they were, but the @racket[factor] element matching the second position is spliced into the surrounding node:
As a convenience, when a grammar element is spliced, or a rule name is cut, @racket[brag] preserves the rule name by adding it as a syntax property to the residual elements, using the rule name as a key, and the original syntax object representing the rule name as the value.
@subsection{Syntax errors}
Besides the basic syntax errors that can occur with a malformed grammar, there
are a few other classes of situations that @litchar{#lang brag} will consider
as syntax errors.
@tt{brag} will raise a syntax error if the grammar:
@itemize[
@item{doesn't have any rules.}
@item{has a rule with the same left hand side as any other rule.}
@item{refers to rules that have not been defined. e.g. the
following program:
@nested[#:style 'code-inset
@verbatim|{
#lang brag
foo: [bar]
}|
]
should raise an error because @tt{bar} has not been defined, even though
@tt{foo} refers to it in an @tech{optional pattern}.}
@item{uses the token name @racket[EOF]; the end-of-file token type is reserved
for internal use by @tt{brag}.}
@item{contains a rule that has no finite derivation. e.g. the following
program:
@nested[#:style 'code-inset
@verbatim|{
#lang brag
infinite-a: "a" infinite-a
}|
]
should raise an error because no finite sequence of tokens will satisfy
@tt{infinite-a}.}
]
Otherwise, @tt{brag} should be fairly tolerant and permit even ambiguous
The @racketmodname[brag/support] module provides functions to interact with
@tt{brag} programs. The most useful is the @racket[token] function, which
produces tokens to be parsed.
In addition to the exports shown below, the @racketmodname[brag/support] module also provides everything from @racketmodname[brag/support], and everything from @racketmodname[br-parser-tools/lex].
@defproc[(token [type (or/c string? symbol?)]
[val any/c #f]
[#:line line (or/c positive-integer? #f) #f]
[#:column column (or/c natural-number? #f) #f]
[#:position position (or/c positive-integer? #f) #f]
[#:span span (or/c natural-number? #f) #f]
[#:skip? skip? boolean? #f]
)
token-struct?]{
Creates instances of @racket[token-struct]s.
The syntax objects produced by a parse will inject the value @racket[val] in
place of the token name in the grammar.
If @racket[#:skip?] is true, then the parser will skip over it during a
parse.}
@defstruct[token-struct ([type symbol?]
[val any/c]
[position (or/c positive-integer? #f)]
[line (or/c natural-number? #f)]
[column (or/c positive-integer? #f)]
[span (or/c natural-number? #f)]
[skip? boolean?])
#:transparent]{
The token structure type.
Rather than directly using the @racket[token-struct] constructor, please use
the helper function @racket[token] to construct instances.
Repeatedly apply @racket[tokenizer-maker] to @racket[source], gathering the resulting tokens into a list. @racket[source] can be a string or an input port. Useful for testing or debugging a tokenizer.
}
@defproc[(apply-lexer [lexer procedure?]
[source (or/c string?
input-port?)])
list?]{
Repeatedly apply @racket[lexer] to @racket[source], gathering the resulting tokens into a list. @racket[source] can be a string or an input port. Useful for testing or debugging a lexer.
}
@defproc[(trim-ends [left-str string?]
[str string?]
[right-str string?])
string?]{
Remove @racket[left-str] from the left side of @racket[str], and @racket[right-str] from its right side. Intended as a helper function for @racket[from/to].
}
@defform[(:* re ...)]{
Repetition of @racket[re] sequence 0 or more times.}
@defform[(:+ re ...)]{
Repetition of @racket[re] sequence 1 or more times.}
@defform[(:? re ...)]{
Zero or one occurrence of @racket[re] sequence.}
@defform[(:= n re ...)]{
Exactly @racket[n] occurrences of @racket[re] sequence, where
@racket[n] must be a literal exact, non-negative number.}
@defform[(:>= n re ...)]{
At least @racket[n] occurrences of @racket[re] sequence, where
@racket[n] must be a literal exact, non-negative number.}
@defform[(:** n m re ...)]{
Between @racket[n] and @racket[m] (inclusive) occurrences of
@racket[re] sequence, where @racket[n] must be a literal exact,
non-negative number, and @racket[m] must be literally either
@racket[#f], @racket[+inf.0], or an exact, non-negative number; a
@racket[#f] value for @racket[m] is the same as @racket[+inf.0].}
@defform[(:or re ...)]{
Same as @racket[(union re ...)].}
@deftogether[(
@defform[(:: re ...)]
@defform[(:seq re ...)]
)]{
Both forms concatenate the @racket[re]s.}
@defform[(:& re ...)]{
Intersects the @racket[re]s.}
@defform[(:- re ...)]{
The set difference of the @racket[re]s.}
@defform[(:~ re ...)]{
Character-set complement, which each @racket[re] must match exactly
one character.}
@defform[(:/ char-or-string ...)]{
Character ranges, matching characters between successive pairs of
characters.}
@defform[(from/to open close)]{
A string that is bounded by @racket[open] and @racket[close]. Matching is non-greedy (meaning, it stops at the first occurence of @racket[close]). The resulting lexeme includes @racket[open] and @racket[close]. To remove them, see @racket[trim-ends].}
@defform[(from/stop-before open close)]{
Like @racket[from/to], a string that is bounded by @racket[open] and @racket[close], except that @racket[close] is not included in the resulting lexeme. Matching is non-greedy (meaning, it stops at the first occurence of @racket[close]).}