main
Matthew Butterick 5 years ago
parent d6aaf103aa
commit 384bbae48c

@ -0,0 +1,3 @@
#lang info
(define version "0.0")
(define scribblings '(("scribblings/xenomorph.scrbl" ())))

@ -1,5 +1,5 @@
#lang racket/base
(require "base.rkt" racket/class)
(require "base.rkt" racket/class racket/contract)
(provide (all-defined-out))
#|
@ -21,6 +21,9 @@ https://github.com/mbutterick/restructure/blob/master/src/Number.coffee
(define (exact-if-possible x) (if (integer? x) (inexact->exact x) x))
(define (endian-value? x)
(and (symbol? x) (memq x '(be le))))
(define system-endian (if (system-big-endian?) 'be 'le))
(define x:number%
@ -73,15 +76,28 @@ https://github.com/mbutterick/restructure/blob/master/src/Number.coffee
([i (in-range @size)])
(values (cons (bitwise-and val #xff) bs) (arithmetic-shift val -8))))))
(define (x:int [size-arg #f]
(define/contract (x:int [size-arg #f]
#:size [size-kwarg 2]
#:signed [signed #true]
#:endian [endian system-endian]
#:pre-encode [pre-proc #f]
#:post-decode [post-proc #f]
#:base-class [base-class x:int%])
(()
((or/c exact-positive-integer? #false)
#:size exact-positive-integer?
#:signed boolean?
#:endian endian-value?
#:pre-encode (or/c (any/c . -> . any/c) #false)
#:post-decode (or/c (any/c . -> . any/c) #false)
#:base-class (λ (c) (subclass? c x:int%)))
. ->* .
x:int?)
(define size (or size-arg size-kwarg))
(unless (exact-positive-integer? size)
(raise-argument-error 'x:int "exact positive integer" size))
(new (generate-subclass base-class pre-proc post-proc)
[size (or size-arg size-kwarg)]
[size size]
[signed signed]
[endian endian]))

@ -1,6 +1,6 @@
#lang scribble/manual
@(require scribble/eval (for-label racket xenomorph))
@(require scribble/eval (for-label racket/base racket/class racket/file xenomorph))
@(define my-eval (make-base-eval))
@(my-eval `(require xenomorph))
@ -8,27 +8,29 @@
@title{Xenomorph: binary encoding & decoding}
@author[(author+email "Matthew Butterick" "mb@mbtype.com") "Devon Govett"]
@defmodule[xenomorph]
@author[(author+email "Matthew Butterick" "mb@mbtype.com")]
@margin-note{This package is in development. I make no commitment to maintaining the public interface documented below.}
Hands up: who likes parsing and writing binary formats?
@defmodule[xenomorph]
Hands up: who likes working with binary formats?
OK, just a few of you, in the back. You're free to go.
Everyone else: Xenomorph eases the pain of working with binary formats. Instead of fiddling with counting bytes:
As for everyone else: Xenomorph eases the pain of working with binary formats. Instead of laboriously counting bytes —
@itemlist[#:style 'ordered
@item{You define an @deftech{encoding} describing the binary format using smaller ingredients — e.g., integers, strings, arrays, pointers, and sub-encodings.}
@item{You describe a binary format declaratively by using smaller ingredients — e.g., integers, strings, lists, pointers, structs, and perhaps other nested encodings. This is known as a @deftech{xenomorphic object}.}
@item{This encoding can then be used as a binary compiler, converting Racket values to binary and writing them out to a file.}
@item{This xenomorphic object can then be used as a binary encoder, allowing you to convert Racket values to binary and write them out to a file.}
@item{But wait, there's more: this encoding can @emph{also} be used as a binary parser, reading bytes and parsing them into Racket values. So one encoding definition can be used for both input and output.}
@item{But wait, there's more: once defined, this xenomorphic object can @emph{also} be used as a binary decoder, reading bytes and parsing them into Racket values.}
]
Derived principally from Devon Govett's @link["https://github.com/devongovett/restructure"]{@tt{restructure}} library for Node. Thanks for doing the heavy lifting, dude.
So one binary-format definition can be used for both input and output. Meanwhile, Xenomorph handles all the dull housekeeping of counting bytes (because somebody has to).
This package is derived principally from Devon Govett's @link["https://github.com/devongovett/restructure"]{@tt{restructure}} library for Node.js. Thanks for doing the heavy lifting, dude.
@section{Installation}
@ -46,7 +48,7 @@ Invoke the library in a source file by importing it in the usual way:
@verbatim{(require xenomorph)}
@;{
@section{Quick tutorial}
@examples[#:eval my-eval
@ -62,18 +64,20 @@ Invoke the library in a source file by importing it in the usual way:
]
}
@section{The big picture}
@subsection{Bytes and byte strings}
Suppose we have a file on disk. What's in the file? Without knowing anything else, we can at least say the file contains a sequence of @deftech{bytes}. A @deftech{byte} is the smallest unit of data storage. It's not, however, the smallest unit of information storage — that would be a @deftech{bit}. But when we read (or write) from disk (or other source, like memory), we work with bytes.
Suppose we have a file on disk. What's in the file? Without knowing anything else, we can at least say the file contains a sequence of @deftech{bytes}. A @tech{byte} is the smallest unit of data storage. It's not, however, the smallest unit of information storage — that would be a @deftech{bit}. But when we read (or write) from disk (or other source, like memory), we work with bytes.
A byte holds eight bits, so it can take on values between 0 and 255, inclusive. In Racket, a sequence of bytes is also known as a @deftech{byte string}. It prints as a series of values between quotation marks, prefixed with @litchar{#}:
@racketblock[#"ABC"]
Caution: though this looks similar to the ordinary string @racket["ABC"], we're better off thinking of it as a sequence of integers that are sometimes displayed as characters for convenience. For instance, the byte string above represents three bytes valued 65, 66, and 67. This byte string could also be written in hexadecimal like so:
Caution: though this looks similar to the ordinary string @racket["ABC"], we're better off thinking of it as a block of integers that are sometimes displayed as characters for convenience. For instance, the byte string above represents three bytes valued 65, 66, and 67. This byte string could also be written in hexadecimal like so:
@(racketvalfont "#\"\\x41\\x42\\x43\"")
@ -85,12 +89,9 @@ Both of these mean the same thing. (If you like, confirm this by trying them on
We can also make an equivalent byte string with @racket[bytes]. As above, Racket doesn't care how we notate the values, as long as they're between 0 and 255:
@bold{TODO}: escape the chars below
@examples[#:eval my-eval
(bytes 65 66 67)
(bytes #x41 #x42 #x43)
(bytes #o101 #o102 #o103)
(bytes (+ 31 34) (* 3 22) (- 100 33))
(apply bytes (map char->integer '(#\A #\B #\C)))
]
@ -100,20 +101,25 @@ Byte values between 32 and 127 are printed as characters. Other values are print
(bytes 65 66 67 154 206 255)
]
If you think this printing convention is a little weird, I agree. But that's how Racket does it. If we prefer to deal with lists of integers, we can always use @racket[bytes->list] and @racket[list->bytes]:
If you think this printing convention is a little weird, I agree. But that's how Racket does it.
If we prefer to deal with lists of integers, we can always use @racket[bytes->list] and @racket[list->bytes]:
@examples[#:eval my-eval
(bytes->list #"ABC\232\316\377")
(list->bytes '(65 66 67 154 206 255))
]
The important thing is that when we see the @litchar{#"} prefix, we know we're looking at a byte string, not an ordinary string.
The key point: the @litchar{#"} prefix tells us we're looking at a byte string, not an ordinary string.
@subsection{Encodings}
@subsection{Binary formats}
Back to files. Typically, files on disk are classified as being either @deftech{binary} or @deftech{text}. (A distinction observed by Racket functions such as @racket[write-to-file].) When we speak of binary vs. text, we're saying something about the internal structure of the byte sequence — what values those bytes represent. This internal structure is also called an @deftech{encoding}. An encoding is a way of representing a sequence of arbitrary values as a sequence of bytes.
Back to files. Files are classified as being either @deftech{binary} or @deftech{text}. (A distinction observed by Racket functions such as @racket[write-to-file].) When we speak of binary vs. text, we're saying something about the internal structure of the byte sequence — what values those bytes represent. We'll call this internal structure the @deftech{binary format} of the file.
@margin-note{This internal structure is also called an @tech{encoding}. Here, however, I avoid using that term as a synonym for @tech{binary format}, because I prefer to reserve it for when we talk about encoding and decoding as operations on data.}
@;{
@subsubsection{Text encodings}
Text files are a just a particular subset of binary files that use a @deftech{text encoding} — that is, a binary encoding that stores human-readable characters.
@ -137,6 +143,7 @@ Now consider the string @racket["ABÇ战"], which has four characters, but the s
Moreover, for further simplicity, text files typically rely on a small set of pre-defined encodings, like ASCII or UTF-8 or Latin-1, so that those who write programs that manipulate text only have to support a smallish set of encodings.
@subsubsection{Binary encodings}
@ -158,57 +165,233 @@ For those familiar with programming-language lingo, an encoding somewhat resembl
@margin-note{Can a grammar work as a binary encoding? In limited cases, but not enough to be practical. Most grammars have to assume the target program is context free, meaning that the grammar rules apply the same way everywhere. By contrast, binary files are nonrecursive and contextual.}
}
@section{Main interface}
@section{Core functions}
@defproc[
(xenomorphic?
[x any/c])
boolean?]{
Predicate for whether @racket[x] is an object of type @racket[x:base%].
}
@defproc[
(decode
[template (is-a?/c xenomorph-base%)]
[byte-source (or/c bytes? input-port?) (current-input-port)])
[xenomorphic-obj xenomorphic?]
[byte-source (or/c bytes? input-port?) (current-input-port)]
[#:parent parent (or/c xenomorphic? #false) #false]
[arg any/c] ...)
any/c]{
TK
Read bytes from @racket[byte-source] and convert them to a Racket value using @racket[xenomorphic-obj] as the decoder.
}
@defproc[
(encode
[template (is-a?/c xenomorph-base%)]
[v any/c]
[byte-dest (or/c output-port? #f) (current-output-port)])
[xenomorphic-obj xenomorphic?]
[val any/c]
[byte-dest (or/c output-port? #false) (current-output-port)]
[#:parent parent (or/c xenomorphic? #false) #false]
[arg any/c] ...
)
(or/c void? bytes?)]{
TK
Convert @racket[val] to bytes using @racket[xenomorphic-obj] as the encoder.
If @racket[byte-dest] is an @racket[output-port?], the bytes are written there and the return value is @racket[(void)]. If @racket[byte-dest] is @racket[#false], the encoded byte string is the return value.
}
@defproc[
(size
[template (is-a?/c xenomorph-base%)]
[v any/c])
[xenomorphic-obj xenomorphic?]
[val any/c #false]
[#:parent parent (or/c xenomorphic? #false) #false]
[arg any/c] ...)
exact-nonnegative-integer?]{
TK
The length of the @tech{byte string} that @racket[val] would produce if it were encoded using @racket[encode].
}
@section{Binary ingredients}
@section{Core xenomorphic objects}
These basic xenomorphic objects can be used on their own, or combined to make bigger xenomorphic objects.
Note on naming: the main xenomorphix objects have an @litchar{x:} prefix to distinguish them from (and prevent name collisions with) the ordinary Racket thing (for instance, @racket[x:list] vs. @racket[list]). Other xenomorphic objects (like @racket[uint8]) don't have this prefix, because it seems unnecessary and therefore laborious.
@defclass[x:base% object% ()]{
When making your own xenomorphic objects, usually you'll want to stick together existing core objects, or inherit from one of those classes. Inheriting from @racket[x:base%] is also allowed, but you have to do all the heavy lifting.
@defmethod[
#:mode pubment
(x:decode
[input-port input-port?]
[parent (or/c xenomorphic? #false)]
[args any/c] ...)
any/c]{
Read bytes from @racket[input-port] and convert them into a Racket value. Called by @racket[decode].
}
@defmethod[
(post-decode
[val any/c])
any/c]{
Hook for post-processing on @racket[val] after it's returned by @racket[x:decode] but before it's returned by @racket[decode].
}
@defmethod[
#:mode pubment
(x:encode
[val any/c]
[output-port output-port?]
[parent (or/c xenomorphic? #false)]
[args any/c] ...)
bytes?]{
Convert a value into a @tech{byte string} which is written to @racket[output-port]. Called by @racket[encode].
}
@defmethod[
(pre-encode
[val any/c])
any/c]{
Hook for pre-processing on @racket[val] after it's passed to @racket[encode] but before it's passed to @racket[x:encode].
}
@defmethod[
#:mode pubment
(x:size
[val any/c]
[parent (or/c xenomorphic? #false)]
[args any/c] ...)
exact-nonnegative-integer?]{
The length of the @tech{byte string} that @racket[val] would produce if it were encoded using @racket[x:encode]. Called by @racket[size].
}
}
@subsection{Numbers}
@defmodule[xenomorph/number]
@defproc[
(endian-value?
[val any/c])
boolean?]{
Whether @racket[val] is either @racket['be] (representing big endian) or @racket['le] (representing little endian).
}
@defthing[system-endian endian-value?]{
The endian value of the current system. Big endian is represented as @racket['be] and little endian as @racket['le]. This can be used as an argument for the @racket[x:number%] constructor.
Use this value carefully, however, as most binary formats are defined using one endian convention or the other (so that they can be exchanged among systems regardless of endianness).
}
@defclass[x:number% x:base% ()]{
@defconstructor[
([size exact-positive-integer?]
[endian endian-value?])]{
Create class instance that represents a binary number format @racket[size] bytes long with @racket[endian] byte ordering. The endian arugment can be @racket[system-endian].
}
}
@subsubsection{Integers}
@defclass[x:int% x:number% ()]{
Base class for integer objects. Use @racket[x:int] to conveniently instantiate new integer objects.
}
@defproc[
(x:int?
[x any/c])
boolean?]{
Predicate for whether @racket[x] is an object of type @racket[x:int%].
}
@defproc[
(x:int
[size-arg (or/c exact-positive-integer? #false) #false]
[#:size size-kw exact-positive-integer? 2]
[#:signed signed boolean? #true]
[#:endian endian endian-value? system-endian]
[#:pre-encode pre-encode-proc (or/c (any/c . -> . any/c) #false) #false]
[#:post-decode post-decode-proc (or/c (any/c . -> . any/c) #false) #false]
[#:base-class base-class (λ (c) (subclass? c x:int%)) x:int%]
)
x:int?]{
Generate an instance of @racket[x:int%] (or a subclass of @racket[x:int%]) with certain optional attributes.
@racket[size-arg] or @racket[size-kw] (whichever is provided, though @racket[size-kw] takes precedence) controls the encoded size.
@racket[signed] controls whether the integer is signed or unsigned.
@racket[endian] controls the byte-ordering convention.
@racket[pre-encode-proc] and @racket[post-decode-proc] control the pre-encoding and post-decodeing procedures, respectively.
@racket[base-class] controls the class used for instantiation of the new object.
}
@deftogether[
(@defthing[int8 x:int?]
@defthing[int16 x:int?]
@defthing[int24 x:int?]
@defthing[int32 x:int?]
@defthing[uint8 x:int?]
@defthing[uint16 x:int?]
@defthing[uint24 x:int?]
@defthing[uint32 x:int?])
]{
The common integer types, using @racket[system-endian] endianness. The @racket[u] prefix indicates unsigned. The numerical suffix indicates bit length.
Use these carefully, however, as most binary formats are defined using one endian convention or the other (so that they can be exchanged among systems regardless of endianness).
}
@deftogether[
(@defthing[int8be x:int?]
@defthing[int16be x:int?]
@defthing[int24be x:int?]
@defthing[int32be x:int?]
@defthing[uint8be x:int?]
@defthing[uint16be x:int?]
@defthing[uint24be x:int?]
@defthing[uint32be x:int?])
]{
Big-endian versions of the common integer types. The @racket[u] prefix indicates unsigned. The numerical suffix indicates bit length.
}
@deftogether[
(@defthing[int8le x:int?]
@defthing[int16le x:int?]
@defthing[int24le x:int?]
@defthing[int32le x:int?]
@defthing[uint8le x:int?]
@defthing[uint16le x:int?]
@defthing[uint24le x:int?]
@defthing[uint32le x:int?])
]{
Little-endian versions of the common integer types. The @racket[u] prefix indicates unsigned. The numerical suffix indicates bit length.
}
@subsubsection{Floats}
@subsubsection{Fixed-points}
@subsection{Strings}
@defmodule[xenomorph/string]
@subsection{Arrays}
@defmodule[xenomorph/array]
@subsection{Lists}
@subsection{Lazy arrays}
@defmodule[xenomorph/list]
@defmodule[xenomorph/lazy-array]
@subsection{Streams}
@defmodule[xenomorph/stream]
@subsection{Structs}

Loading…
Cancel
Save