Form} (BNF). So let's try to notate the structure of nested word lists in BNF.
Second, how might we describe this formally —meaning, in a way that a computer could understand? A common notation to describe the structure of these things is @link["http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form"]{Backus-Naur Form} (BNF). So let's try to notate the structure of nested word lists in BNF.
@nested[#:style 'code-inset]{
@nested[#:style 'code-inset]{
@verbatim{
@verbatim{
@ -60,12 +56,7 @@ nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN
| LEFT-PAREN nested-word-list* RIGHT-PAREN
}}
}}
What we intend by this notation is this: @racket[nested-word-list] is either an
What we intend by this notation is this: @racket[nested-word-list] is either a @racket[WORD], or a parenthesized list of @racket[nested-word-list]s. We use the character @litchar{*} to represent zero or more repetitions of the previous thing. We treat the uppercased @racket[LEFT-PAREN], @racket[RIGHT-PAREN], and @racket[WORD] as placeholders for @emph{tokens} (a @deftech{token} being the smallest meaningful item in the parsed string):
atomic @racket[WORD], or a parenthesized list of any number of
@racket[nested-word-list]s. We use the character @litchar{*} to represent zero
or more repetitions of the previous thing, and we treat the uppercased
@racket[LEFT-PAREN], @racket[RIGHT-PAREN], and @racket[WORD] as placeholders
for atomic @emph{tokens}.
Here are a few examples of tokens:
Here are a few examples of tokens:
@interaction[#:eval my-eval
@interaction[#:eval my-eval
@ -74,15 +65,11 @@ Here are a few examples of tokens:
(token 'WORD "crunchy" #:span 7)
(token 'WORD "crunchy" #:span 7)
(token 'RIGHT-PAREN)]
(token 'RIGHT-PAREN)]
This BNF description is also known as a @deftech{grammar}. Just as it does in a natural language like English or French, a grammar describes something in terms of what elements can fit where.
Have we made progress? At this point, we only have a BNF description in hand,
Have we made progress? We have a valid grammar. But we're still missing a @emph{parser}: a function that can use that description to make structures out of a sequence of tokens.
but we're still missing a @emph{parser}, something to take that description and
use it to make structures out of a sequence of tokens.
Meanwhile, it's clear that we don't yet have a valid program because there's no @litchar{#lang} line. Let's add one: put @litchar{#lang brag} at the top of the grammar, and save it as a file called @filepath{nested-word-list.rkt}.
It's clear that we don't yet have a program because there's no @litchar{#lang}
line. We should add one. Put @litchar{#lang brag} at the top of the BNF
description, and save it as a file called @filepath{nested-word-list.rkt}.
doctrine; the structure of the grammar informs the structure of the
guidelines. The structure of the grammar informs the structure of the
Racket syntax objects it generates.}
Racket syntax objects it generates.}
@item{The language uses a few conventions to simplify the expression of
@item{The language uses a few conventions to simplify the expression of
grammars. The first rule in the grammar is automatically assumed to be the
grammars. The first rule in the grammar is assumed to be the
starting production. Identifiers in uppercase are assumed to represent
starting production. Identifiers in @tt{UPPERCASE} are treated as
terminal tokens, and are otherwise the names of nonterminals.}
terminal tokens. All other identifiers are treated as nonterminals.}
@item{Tokenizers can be developed completely independently of parsers.
@item{Tokenizers can be developed independently of parsers.
@tt{brag} takes a liberal view on tokens: they can be strings,
@tt{brag} takes a liberal view on tokens: they can be strings,
symbols, or instances constructed with @racket[token]. Furthermore,
symbols, or instances constructed with @racket[token]. Tokens can optionally provide source location, in which case a syntax object generated by the parser will too.}
tokens can optionally provide location: if tokens provide location, the
generated syntax objects will as well.}
@item{The underlying parser should be able to handle ambiguous grammars.}
@item{The parser can usually handle ambiguous grammars.}
@item{It should integrate with the rest of the Racket
@link["http://stackoverflow.com/questions/12345647/rewrite-this-script-by-designing-an-interpreter-in-racket"]{derived from a question} on Stack Overflow.}
of a question on Stack Overflow}.} To motivate @tt{brag}'s design, let's look
at the following toy problem: we'd like to define a language for
To understand @tt{brag}'s design, let's look
drawing simple ASCII diagrams. We'd like to be able write something like this:
at a toy problem. We'd like to define a language for
drawing simple ASCII diagrams. So if we write something like this:
@nested[#:style 'inset]{
@nested[#:style 'inset]{
@verbatim|{
@verbatim|{
@ -197,7 +182,7 @@ drawing simple ASCII diagrams. We'd like to be able write something like this:
3 9 X;
3 9 X;
}|}
}|}
whose interpretation should generate the following picture:
It should generate the following picture:
@nested[#:style 'inset]{
@nested[#:style 'inset]{
@verbatim|{
@verbatim|{
@ -218,10 +203,11 @@ XXXXXXXXX
@subsection{Syntax and semantics}
@subsection{Syntax and semantics}
We're being very fast-and-loose with what we mean by the program above, so
let's try to nail down some meanings. Each line of the program has a semicolon
We're being somewhat casual with what we mean by the program above, so
at the end, and describes the output of several @emph{rows} of the line
let's try to nail down some meanings.
drawing. Let's look at two of the lines in the example:
Each line of the program has a semicolon at the end, and describes the output of several @emph{rows} of the line drawing. Let's look at two of the lines in the example:
@itemize[
@itemize[
@item{@litchar{3 9 X;}: ``Repeat the following 3 times: print @racket["X"] nine times, followed by
@item{@litchar{3 9 X;}: ``Repeat the following 3 times: print @racket["X"] nine times, followed by
@ -232,21 +218,14 @@ followed by @racket["X"] three times, followed by @racket[" "] three times, foll
]
]
Then each line consists of a @emph{repeat} number, followed by pairs of
Then each line consists of a @emph{repeat} number, followed by pairs of
(number, character) @emph{chunks}. We will
(number, character) @emph{chunks}. We'll assume here that the intent of the lowercased character @litchar{b} is to represent the printing of a 1-character whitespace @racket[" "], and for other uppercase letters to represent the printing of themselves.
assume here that the intent of the lowercased character @litchar{b} is to
represent the printing of a 1-character whitespace @racket[" "], and for other
uppercase letters to represent the printing of themselves.
Once we have a better idea of the pieces of each line, we have a better chance
to capture that meaning in a formal notation. Once we have each instruction in
a structured format, we should be able to interpret it with a straighforward
case analysis.
Here is a first pass at expressing the structure of these line-drawing
By understanding the pieces of each line, we can more easily capture that meaning in a grammar. Once we have each instruction of our ASCII DSL in a structured format, we should be able to parse it.
programs.
Here's a first pass at expressing the structure of these line-drawing programs.
@subsection{Parsing the concrete syntax}
@subsection{Parsing the concrete syntax}
@filebox["simple-line-drawing.rkt"]{
@filebox["simple-line-drawing.rkt"]{
@verbatim|{
@verbatim|{
#lang brag
#lang brag
@ -258,7 +237,7 @@ chunk: INTEGER STRING
}
}
@margin-note{@secref{brag-syntax} describes @tt{brag}'s syntax in more detail.}
@margin-note{@secref{brag-syntax} describes @tt{brag}'s syntax in more detail.}
We write a @tt{brag} program as an extended BNF grammar, where patterns can be:
We write a @tt{brag} program as an BNF grammar, where patterns can be:
@itemize[
@itemize[
@item{the names of other rules (e.g. @racket[chunk])}
@item{the names of other rules (e.g. @racket[chunk])}
@item{literal and symbolic token names (e.g. @racket[";"], @racket[INTEGER])}
@item{literal and symbolic token names (e.g. @racket[";"], @racket[INTEGER])}
@ -282,17 +261,11 @@ Let's exercise this function:
(syntax->datum stx)
(syntax->datum stx)
]
]
Tokens can either be: plain strings, symbols, or instances produced by the
A @emph{token} is the smallest meaningful element of a source program. Tokens can be strings, symbols, or instances of the @racket[token] data structure. (Plus a few other special cases, which we'll discuss later.) Usually, a token holds a single character from the source program. But sometimes it makes sense to package a sequence of characters into a single token, if the sequence has an indivisible meaning.
@racket[token] function. (Plus a few more special cases, one in which we'll describe in a
moment.)
Preferably, we want to attach each token with auxiliary source location
If possible, we also want to attach source location information to each token. Why? Because this informatino will be incorporated into the syntax objects produced by @racket[parse].
information. The more source location we can provide, the better, as the
syntax objects produced by @racket[parse] will incorporate them.
Let's write a helper function, a @emph{lexer}, to help us construct tokens more
A parser often works in conjunction with a helper function called a @emph{lexer} that converts the raw code of the source program into tokens. The @racketmodname[parser-tools/lex] library can help us write a position-sensitive
easily. The Racket standard library comes with a module called
@racketmodname[parser-tools/lex] which can help us write a position-sensitive
tokenizer:
tokenizer:
@interaction[#:eval my-eval
@interaction[#:eval my-eval
@ -328,24 +301,19 @@ tokenizer:
]
]
There are a few things to note from this lexer example:
Note also from this lexer example:
@itemize[
@itemize[
@item{The @racket[parse] function can consume either sequences of tokens, or a
@item{@racket[parse] accepts as input either a sequence of tokens, or a
function that produces tokens. Both of these are considered sources of
function that produces tokens (which @racket[parse] will call repeatedly to get the next token).}
tokens.}
@item{As a special case for acceptable tokens, a token can also be an instance
@item{As an alternative to the basic @racket[token] structure, a token can also be an instance of the @racket[position-token] structure (also found in @racketmodname[parser-tools/lex]). In that case, the token will try to derive its position from that of the position-token.}
of the @racket[position-token] structure of @racketmodname[parser-tools/lex],
in which case the token will try to derive its position from that of the
position-token.}
@item{The @racket[parse] function will stop reading from a token source if any
@item{@racket[parse] will stop if it gets @racket[void] (or @racket['eof]) as a token.}
token is @racket[void].}
@item{The @racket[parse] function will skip over any token with the
@item{@racket[parse] will skip any token that has
@racket[#:skip?] attribute. Elements such as whitespace and comments will
@racket[#:skip?] attribute set to @racket[#t]. For instance, tokens representing comments often use @racket[#:skip?].}
often have @racket[#:skip?] set to @racket[#t].}
]
]
@ -353,16 +321,16 @@ often have @racket[#:skip?] set to @racket[#t].}
@subsection{From parsing to interpretation}
@subsection{From parsing to interpretation}
We now have a parser for programs written in this simple-line-drawing language.
We now have a parser for programs written in this simple-line-drawing language.
Our parser will give us back syntax objects:
Our parser will return syntax objects:
@interaction[#:eval my-eval
@interaction[#:eval my-eval
(define parsed-program
(define parsed-program
(parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
(parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
(syntax->datum parsed-program)
(syntax->datum parsed-program)
]
]
Moreover, we know that these syntax objects have a regular, predictable
Better still, these syntax objects will have a predictable
structure. Their structure follows the grammar, so we know we'll be looking at
structure that follows the grammar:
values of the form:
@racketblock[
@racketblock[
(drawing (rows (repeat <number>)
(drawing (rows (repeat <number>)
@ -374,15 +342,14 @@ where @racket[drawing], @racket[rows], @racket[repeat], and @racket[chunk]
should be treated literally, and everything else will be numbers or strings.
should be treated literally, and everything else will be numbers or strings.
Still, these syntax object values are just inert structures. How do we
Still, these syntax-object values are just inert structures. How do we
interpret them, and make them @emph{print}? We did claim at the beginning of
interpret them, and make them @emph{print}? We claimed at the beginning of
this section that these syntax objects should be fairly easy to case-analyze
this section that these syntax objects should be easy to interpret. So let's do it.
and interpret, so let's do it.
@margin-note{This is a very quick-and-dirty treatment of @racket[syntax-parse].
@margin-note{This is a very quick-and-dirty treatment of @racket[syntax-parse].
See the @racketmodname[syntax/parse] documentation for a gentler guide to its
See the @racketmodname[syntax/parse] documentation for a gentler guide to its
features.} Racket provides a special form called @racket[syntax-parse] in the
features.} Racket provides a special form called @racket[syntax-parse] in the
@racketmodname[syntax/parse] library. @racket[syntax-parse] lets us do a
@racketmodname[syntax/parse] library. @racket[syntax-parse] lets us do a
structural case-analysis on syntax objects: we provide it a set of patterns to
structural case-analysis on syntax objects: we provide it a set of patterns to
parse and actions to perform when those patterns match.
parse and actions to perform when those patterns match.
@ -405,7 +372,7 @@ says @racket[#t] if it's the literal @racket[yes], and @racket[#f] otherwise:
]
]
Here, we use @racket[~literal] to let @racket[syntax-parse] know that
Here, we use @racket[~literal] to let @racket[syntax-parse] know that
@racket[yes] should show up literally in the syntax object. The patterns can
@racket[yes] should show up literally in the syntax object. The patterns can
also have some structure to them, such as:
also have some structure to them, such as:
@racketblock[({~literal drawing} rows-stxs ...)]
@racketblock[({~literal drawing} rows-stxs ...)]
which matches on syntax objects that begin, literally, with @racket[drawing],
which matches on syntax objects that begin, literally, with @racket[drawing],