You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
typesetting/quad/qtest/mds/performance.md

585 lines
25 KiB
Markdown

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Performance
Alan Perlis famously quipped “Lisp programmers know the value of
everything and the cost of nothing.” A Racket programmer knows, for
example, that a `lambda` anywhere in a program produces a value that is
closed over its lexical environment—but how much does allocating that
value cost? While most programmers have a reasonable grasp of the cost
of various operations and data structures at the machine level, the gap
between the Racket language model and the underlying computing machinery
can be quite large.
In this chapter, we narrow the gap by explaining details of the Racket
compiler and run-time system and how they affect the run-time and memory
performance of Racket code.
## 1. Performance in DrRacket
By default, DrRacket instruments programs for debugging, and debugging
instrumentation \(provided by the \[missing\] library\) can
significantly degrade performance for some programs. Even when debugging
is disabled through the Choose Language... dialogs Show Details panel,
the Preserve stacktrace checkbox is clicked by default, which also
affects performance. Disabling debugging and stacktrace preservation
provides performance results that are more consistent with running in
plain `racket`.
Even so, DrRacket and programs developed within DrRacket use the same
Racket virtual machine, so garbage collection times \(see Memory
Management\) may be longer in DrRacket than when a program is run by
itself, and DrRacket threads may impede execution of program threads.
**For the most reliable timing results for a program, run in plain
`racket` instead of in the DrRacket development environment.**
Non-interactive mode should be used instead of the REPL to benefit from
the module system. See Modules and Performance for details.
## 2. The Bytecode and Just-in-Time \(JIT\) Compilers
Every definition or expression to be evaluated by Racket is compiled to
an internal bytecode format. In interactive mode, this compilation
occurs automatically and on-the-fly. Tools like `raco make` and `raco
setup` marshal compiled bytecode to a file, so that you do not have to
compile from source every time that you run a program. \(Most of the
time required to compile a file is actually in macro expansion;
generating bytecode from fully expanded code is relatively fast.\) See
\[missing\] for more information on generating bytecode files.
The bytecode compiler applies all standard optimizations, such as
constant propagation, constant folding, inlining, and dead-code
elimination. For example, in an environment where `+` has its usual
binding, the expression `(let ([x 1] [y (lambda () 4)]) (+ 1 (y)))` is
compiled the same as the constant `5`.
On some platforms, bytecode is further compiled to native code via a
_just-in-time_ or _JIT_ compiler. The JIT compiler substantially speeds
programs that execute tight loops, arithmetic on small integers, and
arithmetic on inexact real numbers. Currently, JIT compilation is
supported for x86, x86\_64 \(a.k.a. AMD64\), ARM, and 32-bit PowerPC
processors. The JIT compiler can be disabled via the `eval-jit-enabled`
parameter or the `--no-jit`/`-j` command-line flag for `racket`.
The JIT compiler works incrementally as functions are applied, but the
JIT compiler makes only limited use of run-time information when
compiling procedures, since the code for a given module body or `lambda`
abstraction is compiled only once. The JITs granularity of compilation
is a single procedure body, not counting the bodies of any lexically
nested procedures. The overhead for JIT compilation is normally so small
that it is difficult to detect.
## 3. Modules and Performance
The module system aids optimization by helping to ensure that
identifiers have the usual bindings. That is, the `+` provided by
`racket/base` can be recognized by the compiler and inlined, which is
especially important for JIT-compiled code. In contrast, in a
traditional interactive Scheme system, the top-level `+` binding might
be redefined, so the compiler cannot assume a fixed `+` binding \(unless
special flags or declarations are used to compensate for the lack of a
module system\).
Even in the top-level environment, importing with `require` enables some
inlining optimizations. Although a `+` definition at the top level might
shadow an imported `+`, the shadowing definition applies only to
expressions evaluated later.
Within a module, inlining and constant-propagation optimizations take
additional advantage of the fact that definitions within a module cannot
be mutated when no `set!` is visible at compile time. Such optimizations
are unavailable in the top-level environment. Although this optimization
within modules is important for performance, it hinders some forms of
interactive development and exploration. The
`compile-enforce-module-constants` parameter disables the JIT compilers
assumptions about module definitions when interactive exploration is
more important. See \[missing\] for more information.
The compiler may inline functions or propagate constants across module
boundaries. To avoid generating too much code in the case of function
inlining, the compiler is conservative when choosing candidates for
cross-module inlining; see Function-Call Optimizations for information
on providing inlining hints to the compiler.
The later section `letrec` Performance provides some additional caveats
concerning inlining of module bindings.
## 4. Function-Call Optimizations
When the compiler detects a function call to an immediately visible
function, it generates more efficient code than for a generic call,
especially for tail calls. For example, given the program
```racket
(letrec ([odd (lambda (x)
(if (zero? x)
#f
(even (sub1 x))))]
[even (lambda (x)
(if (zero? x)
#t
(odd (sub1 x))))])
(odd 40000000))
```
the compiler can detect the `odd``even` loop and produce code that runs
much faster via loop unrolling and related optimizations.
Within a module form, `define`d variables are lexically scoped like
`letrec` bindings, and definitions within a module therefore permit call
optimizations, so
```racket
(define (odd x) ....)
(define (even x) ....)
```
within a module would perform the same as the `letrec` version.
For direct calls to functions with keyword arguments, the compiler can
typically check keyword arguments statically and generate a direct call
to a non-keyword variant of the function, which reduces the run-time
overhead of keyword checking. This optimization applies only for
keyword-accepting procedures that are bound with `define`.
For immediate calls to functions that are small enough, the compiler may
inline the function call by replacing the call with the body of the
function. In addition to the size of the target functions body, the
compilers heuristics take into account the amount of inlining already
performed at the call site and whether the called function itself calls
functions other than simple primitive operations. When a module is
compiled, some functions defined at the module level are determined to
be candidates for inlining into other modules; normally, only trivial
functions are considered candidates for cross-module inlining, but a
programmer can wrap a function definition with `begin-encourage-inline`
to encourage inlining of the function.
Primitive operations like `pair?`, `car`, and `cdr` are inlined at the
machine-code level by the JIT compiler. See also the later section
Fixnum and Flonum Optimizations for information about inlined arithmetic
operations.
## 5. Mutation and Performance
Using `set!` to mutate a variable can lead to bad performance. For
example, the microbenchmark
```racket
#lang racket/base
(define (subtract-one x)
(set! x (sub1 x))
x)
(time
(let loop ([n 4000000])
(if (zero? n)
'done
(loop (subtract-one n)))))
```
runs much more slowly than the equivalent
```racket
#lang racket/base
(define (subtract-one x)
(sub1 x))
(time
(let loop ([n 4000000])
(if (zero? n)
'done
(loop (subtract-one n)))))
```
In the first variant, a new location is allocated for `x` on every
iteration, leading to poor performance. A more clever compiler could
unravel the use of `set!` in the first example, but since mutation is
discouraged \(see \[missing\]\), the compilers effort is spent
elsewhere.
More significantly, mutation can obscure bindings where inlining and
constant-propagation might otherwise apply. For example, in
```racket
(let ([minus1 #f])
(set! minus1 sub1)
(let loop ([n 4000000])
(if (zero? n)
'done
(loop (minus1 n)))))
```
the `set!` obscures the fact that `minus1` is just another name for the
built-in `sub1`.
## 6. `letrec` Performance
When `letrec` is used to bind only procedures and literals, then the
compiler can treat the bindings in an optimal manner, compiling uses of
the bindings efficiently. When other kinds of bindings are mixed with
procedures, the compiler may be less able to determine the control flow.
For example,
```racket
(letrec ([loop (lambda (x)
(if (zero? x)
'done
(loop (next x))))]
[junk (display loop)]
[next (lambda (x) (sub1 x))])
(loop 40000000))
```
likely compiles to less efficient code than
```racket
(letrec ([loop (lambda (x)
(if (zero? x)
'done
(loop (next x))))]
[next (lambda (x) (sub1 x))])
(loop 40000000))
```
In the first case, the compiler likely does not know that `display` does
not call `loop`. If it did, then `loop` might refer to `next` before the
binding is available.
This caveat about `letrec` also applies to definitions of functions and
constants as internal definitions or in modules. A definition sequence
in a module body is analogous to a sequence of `letrec` bindings, and
non-constant expressions in a module body can interfere with the
optimization of references to later bindings.
## 7. Fixnum and Flonum Optimizations
A _fixnum_ is a small exact integer. In this case, “small” depends on
the platform. For a 32-bit machine, numbers that can be expressed in 30
bits plus a sign bit are represented as fixnums. On a 64-bit machine, 62
bits plus a sign bit are available.
A _flonum_ is used to represent any inexact real number. They correspond
to 64-bit IEEE floating-point numbers on all platforms.
Inlined fixnum and flonum arithmetic operations are among the most
important advantages of the JIT compiler. For example, when `+` is
applied to two arguments, the generated machine code tests whether the
two arguments are fixnums, and if so, it uses the machines instruction
to add the numbers \(and check for overflow\). If the two numbers are
not fixnums, then it checks whether both are flonums; in that case, the
machines floating-point operations are used directly. For functions
that take any number of arguments, such as `+`, inlining works for two
or more arguments \(except for `-`, whose one-argument case is also
inlined\) when the arguments are either all fixnums or all flonums.
Flonums are typically _boxed_, which means that memory is allocated to
hold every result of a flonum computation. Fortunately, the generational
garbage collector \(described later in Memory Management\) makes
allocation for short-lived results reasonably cheap. Fixnums, in
contrast are never boxed, so they are typically cheap to use.
> See \[missing\] for an example use of flonum-specific operations.
The `racket/flonum` library provides flonum-specific operations, and
combinations of flonum operations allow the JIT compiler to generate
code that avoids boxing and unboxing intermediate results. Besides
results within immediate combinations, flonum-specific results that are
bound with `let` and consumed by a later flonum-specific operation are
unboxed within temporary storage. Unboxing applies most reliably to uses
of a flonum-specific operation with two arguments. Finally, the compiler
can detect some flonum-valued loop accumulators and avoid boxing of the
accumulator. The bytecode decompiler \(see \[missing\]\) annotates
combinations where the JIT can avoid boxes with `#%flonum`,
`#%as-flonum`, and `#%from-flonum`.
> Unboxing of local bindings and accumulators is not supported by the JIT
> for PowerPC.
The `racket/unsafe/ops` library provides unchecked fixnum- and
flonum-specific operations. Unchecked flonum-specific operations allow
unboxing, and sometimes they allow the compiler to reorder expressions
to improve performance. See also Unchecked, Unsafe Operations,
especially the warnings about unsafety.
## 8. Unchecked, Unsafe Operations
The `racket/unsafe/ops` library provides functions that are like other
functions in `racket/base`, but they assume \(instead of checking\) that
provided arguments are of the right type. For example,
`unsafe-vector-ref` accesses an element from a vector without checking
that its first argument is actually a vector and without checking that
the given index is in bounds. For tight loops that use these functions,
avoiding checks can sometimes speed the computation, though the benefits
vary for different unchecked functions and different contexts.
Beware that, as “unsafe” in the library and function names suggest,
misusing the exports of `racket/unsafe/ops` can lead to crashes or
memory corruption.
## 9. Foreign Pointers
The `ffi/unsafe` library provides functions for unsafely reading and
writing arbitrary pointer values. The JIT recognizes uses of `ptr-ref`
and `ptr-set!` where the second argument is a direct reference to one of
the following built-in C types: `_int8`, `_int16`, `_int32`, `_int64`,
`_double`, `_float`, and `_pointer`. Then, if the first argument to
`ptr-ref` or `ptr-set!` is a C pointer \(not a byte string\), then the
pointer read or write is performed inline in the generated code.
The bytecode compiler will optimize references to integer abbreviations
like `_int` to C types like `_int32`—where the representation sizes are
constant across platforms—so the JIT can specialize access with those C
types. C types such as `_long` or `_intptr` are not constant across
platforms, so their uses are currently not specialized by the JIT.
Pointer reads and writes using `_float` or `_double` are not currently
subject to unboxing optimizations.
## 10. Regular Expression Performance
When a string or byte string is provided to a function like
`regexp-match`, then the string is internally compiled into a regexp
value. Instead of supplying a string or byte string multiple times as a
pattern for matching, compile the pattern once to a regexp value using
`regexp`, `byte-regexp`, `pregexp`, or `byte-pregexp`. In place of a
constant string or byte string, write a constant regexp using an `#rx`
or `#px` prefix.
```racket
(define (slow-matcher str)
(regexp-match? "[0-9]+" str))
(define (fast-matcher str)
(regexp-match? #rx"[0-9]+" str))
(define (make-slow-matcher pattern-str)
(lambda (str)
(regexp-match? pattern-str str)))
(define (make-fast-matcher pattern-str)
(define pattern-rx (regexp pattern-str))
(lambda (str)
(regexp-match? pattern-rx str)))
```
## 11. Memory Management
The Racket implementation is available in three variants: _3m_, _CGC_,
and _CS_. The 3m and CS variants use a modern, _generational garbage
collector_ that makes allocation relatively cheap for short-lived
objects. The CGC variant uses a _conservative garbage collector_ which
facilitates interaction with C code at the expense of both precision and
speed for Racket memory management. The 3m variant is currently the
standard one.
Although memory allocation is reasonably cheap, avoiding allocation
altogether is normally faster. One particular place where allocation can
be avoided sometimes is in _closures_, which are the run-time
representation of functions that contain free variables. For example,
```racket
(let loop ([n 40000000] [prev-thunk (lambda () #f)])
(if (zero? n)
(prev-thunk)
(loop (sub1 n)
(lambda () n))))
```
allocates a closure on every iteration, since `(lambda () n)`
effectively saves `n`.
The compiler can eliminate many closures automatically. For example, in
```racket
(let loop ([n 40000000] [prev-val #f])
(let ([prev-thunk (lambda () n)])
(if (zero? n)
prev-val
(loop (sub1 n) (prev-thunk)))))
```
no closure is ever allocated for `prev-thunk`, because its only
application is visible, and so it is inlined. Similarly, in
```racket
(let n-loop ([n 400000])
(if (zero? n)
'done
(let m-loop ([m 100])
(if (zero? m)
(n-loop (sub1 n))
(m-loop (sub1 m))))))
```
then the expansion of the `let` form to implement `m-loop` involves a
closure over `n`, but the compiler automatically converts the closure to
pass itself `n` as an argument instead.
## 12. Reachability and Garbage Collection
In general, Racket re-uses the storage for a value when the garbage
collector can prove that the object is unreachable from any other
\(reachable\) value. Reachability is a low-level, abstraction breaking
concept \(and thus one must understand many details of the runtime
systems implementation to accurate predicate precisely when values are
reachable from each other\), but generally speaking one value is
reachable from a second one when there is some operation to recover the
original value from the second one.
To help programmers understand when an object is no longer reachable and
its storage can be reused, Racket provides `make-weak-box` and
`weak-box-value`, the creator and accessor for a one-record struct that
the garbage collector treats specially. An object inside a weak box does
not count as reachable, and so `weak-box-value` might return the object
inside the box, but it might also return `#f` to indicate that the
object was otherwise unreachable and garbage collected. Note that unless
a garbage collection actually occurs, the value will remain inside the
weak box, even if it is unreachable.
For example, consider this program:
```racket
#lang racket
(struct fish (weight color) #:transparent)
(define f (fish 7 'blue))
(define b (make-weak-box f))
(printf "b has ~s\n" (weak-box-value b))
(collect-garbage)
(printf "b has ~s\n" (weak-box-value b))
```
It will print `b has #(struct:fish 7 blue)` twice because the definition
of `f` still holds onto the fish. If the program were this, however:
```racket
#lang racket
(struct fish (weight color) #:transparent)
(define f (fish 7 'blue))
(define b (make-weak-box f))
(printf "b has ~s\n" (weak-box-value b))
(set! f #f)
(collect-garbage)
(printf "b has ~s\n" (weak-box-value b))
```
the second printout will be `b has #f` because no reference to the fish
exists \(other than the one in the box\).
As a first approximation, all values in Racket must be allocated and
will demonstrate behavior similar to the fish above. There are a number
of exceptions, however:
* Small integers \(recognizable with `fixnum?`\) are always available
without explicit allocation. From the perspective of the garbage
collector and weak boxes, their storage is never reclaimed. \(Due to
clever representation techniques, however, their storage does not
count towards the space that Racket uses. That is, they are
effectively free.\)
* Procedures where the compiler can see all of their call sites may
never be allocated at all \(as discussed above\). Similar
optimizations may also eliminate the allocation for other kinds of
values.
* Interned symbols are allocated only once \(per place\). A table inside
Racket tracks this allocation so a symbol may not become garbage
because that table holds onto it.
* Reachability is only approximate with the CGC collector \(i.e., a
value may appear reachable to that collector when there is, in fact,
no way to reach it anymore\).
## 13. Weak Boxes and Testing
One important use of weak boxes is in testing that some abstraction
properly releases storage for data it no longer needs, but there is a
gotcha that can easily cause such test cases to pass improperly.
Imagine youre designing a data structure that needs to hold onto some
value temporarily but then should clear a field or somehow break a link
to avoid referencing that value so it can be collected. Weak boxes are a
good way to test that your data structure properly clears the value.
This is, you might write a test case that builds a value, extracts some
other value from it \(that you hope becomes unreachable\), puts the
extracted value into a weak-box, and then checks to see if the value
disappears from the box.
This code is one attempt to follow that pattern, but it has a subtle
bug:
```racket
#lang racket
(let* ([fishes (list (fish 8 'red)
(fish 7 'blue))]
[wb (make-weak-box (list-ref fishes 0))])
(collect-garbage)
(printf "still there? ~s\n" (weak-box-value wb)))
```
Specifically, it will show that the weak box is empty, but not because
`fishes` no longer holds onto the value, but because `fishes` itself is
not reachable anymore!
Change the program to this one:
```racket
#lang racket
(let* ([fishes (list (fish 8 'red)
(fish 7 'blue))]
[wb (make-weak-box (list-ref fishes 0))])
(collect-garbage)
(printf "still there? ~s\n" (weak-box-value wb))
(printf "fishes is ~s\n" fishes))
```
and now we see the expected result. The difference is that last
occurrence of the variable `fishes`. That constitutes a reference to the
list, ensuring that the list is not itself garbage collected, and thus
the red fish is not either.
## 14. Reducing Garbage Collection Pauses
By default, Rackets generational garbage collector creates brief pauses
for frequent _minor collections_, which inspect only the most recently
allocated objects, and long pauses for infrequent _major collections_,
which re-inspect all memory.
For some applications, such as animations and games, long pauses due to
a major collection can interfere unacceptably with a programs
operation. To reduce major-collection pauses, the Racket garbage
collector supports _incremental garbage-collection_ mode. In incremental
mode, minor collections create longer \(but still relatively short\)
pauses by performing extra work toward the next major collection. If all
goes well, most of a major collections work has been performed by minor
collections the time that a major collection is needed, so the major
collections pause is as short as a minor collections pause.
Incremental mode tends to run more slowly overall, but it can provide
much more consistent real-time behavior.
If the `PLT_INCREMENTAL_GC` environment variable is set to a value that
starts with `1`, `y`, or `Y` when Racket starts, incremental mode is
permanently enabled. Since incremental mode is only useful for certain
parts of some programs, however, and since the need for incremental mode
is a property of a program rather than its environment, the preferred
way to enable incremental mode is with `(collect-garbage 'incremental)`.
Calling `(collect-garbage 'incremental)` does not perform an immediate
garbage collection, but instead requests that each minor collection
perform incremental work up to the next major collection. The request
expires with the next major collection. Make a call to `(collect-garbage
'incremental)` in any repeating task within an application that needs to
be responsive in real time. Force a full collection with
`(collect-garbage)` just before an initial `(collect-garbage
'incremental)` to initiate incremental mode from an optimal state.
To check whether incremental mode is in use and how it affects pause
times, enable `debug`-level logging output for the `GC` topic. For
example,
  `racket -W "debuG@GC error" main.rkt`
runs `"main.rkt"` with garbage-collection logging to stderr \(while
preserving `error`-level logging for all topics\). Minor collections are
reported by `min` lines, increment-mode minor collection are reported
with `mIn` lines, and major collections are reported with `MAJ` lines.