11.2 Decode

6.1.0.8

11.2 Decode

package: pollen

The doc export of a Pollen markup file is a simple X-expression. Decoding refers to any post-processing of this X-expression. The pollen/decode module provides tools for creating decoders.

The decode step can happen separately from the compilation of the file. But you can also attach a decoder to the markup file’s root node, so the decoding happens automatically when the markup is compiled, and thus automatically incorporated into doc. (Following this approach, you could also attach multiple decoders to different tags within doc.)

You can, of course, embed function calls within Pollen markup. But since markup is optimized for authors, decoding is useful for operations that can or should be moved out of the authoring layer.

One example is presentation and layout. For instance, detect-paragraphs is a decoder function that lets authors mark paragraphs in their source simply by using two carriage returns.

Another example is conversion of output into a particular data format. Most Pollen functions are optimized for HTML output, but one could write a decoder that targets another format.

procedure
(decode tagged-xexpr
[ #:txexpr-tag-proc txexpr-tag-proc
#:txexpr-attrs-proc txexpr-attrs-proc
#:txexpr-elements-proc txexpr-elements-proc
#:block-txexpr-proc block-txexpr-proc
#:inline-txexpr-proc inline-txexpr-proc
#:string-proc string-proc
#:symbol-proc symbol-proc
#:valid-char-proc valid-char-proc
#:cdata-proc cdata-proc
#:exclude-tags tags-to-exclude]) → txexpr?
  tagged-xexpr : txexpr?
   txexpr-tag-proc : (txexpr-tag? . -> . txexpr-tag?)
= (λ(tag) tag)
   txexpr-attrs-proc : (txexpr-attrs? . -> . txexpr-attrs?)
= (λ(attrs) attrs)
   txexpr-elements-proc : (txexpr-elements? . -> . txexpr-elements?)
= (λ(elements) elements)
  block-txexpr-proc : (block-txexpr? . -> . xexpr?) = (λ(tx) tx)
  inline-txexpr-proc : (txexpr? . -> . xexpr?) = (λ(tx) tx)
  string-proc : (string? . -> . xexpr?) = (λ(str) str)
  symbol-proc : (symbol? . -> . xexpr?) = (λ(sym) sym)
  valid-char-proc : (valid-char? . -> . xexpr?) = (λ(vc) vc)
  cdata-proc : (cdata? . -> . xexpr?) = (λ(cdata) cdata)
  tags-to-exclude : (listof symbol?) = null

Recursively process a tagged-xexpr, usually the one exported from a Pollen source file as doc.

This function doesn’t do much on its own. Rather, it provides the hooks upon which harder-working functions can be hung.

Recall from [future link: Pollen mechanics] that any tag can have a function attached to it. By default, the tagged-xexpr from a source file is tagged with root. So the typical way to use decode is to attach your decoding functions to it, and then define root to invoke your decode function. Then it will be automatically applied to every doc during compile.

For instance, here’s how decode is attached to root in Butterick’s Practical Typography. There’s not much to it —

(define (root . items)
  (decode (make-txexpr 'root '() items)
          #:txexpr-elements-proc detect-paragraphs
          #:block-txexpr-proc (compose1 hyphenate wrap-hanging-quotes)
          #:string-proc (compose1 smart-quotes smart-dashes)
          #:exclude-tags '(style script)))

The hyphenate function is not part of Pollen, but rather the hyphenate package, which you can install separately.

This illustrates another important point: even though decode presents an imposing list of arguments, you’re unlikely to use all of them at once. These represent possibilities, not requirements. For instance, let’s see what happens when decode is invoked without any of its optional arguments.

Examples:

> (define tx '(root "I wonder" (em "why") "this works."))
> (decode tx)
'(root "I wonder" (em "why") "this works.")

Right — nothing. That’s because the default value for the decoding arguments is the identity function, (λ (x) x). So all the input gets passed through intact unless another action is specified.

The *-proc arguments of decode take procedures that are applied to specific categories of elements within txexpr.

The txexpr-tag-proc argument is a procedure that handles X-expression tags.

Examples:

> (define tx '(p "I'm from a strange" (strong "namespace")))
; Tags are symbols, so a tag-proc should return a symbol
> (decode tx #:txexpr-tag-proc (λ(t) (string->symbol (format "ns:~a" t))))
'(ns:p "I'm from a strange" (ns:strong "namespace"))

The txexpr-attrs-proc argument is a procedure that handles lists of X-expression attributes. (The txexpr module, included at no extra charge with Pollen, includes useful helper functions for dealing with these attribute lists.)

Examples:

> (define tx '(p [[id "first"]] "If I only had a brain."))
; Attrs is a list, so cons is OK for simple cases
> (decode tx #:txexpr-attrs-proc (λ(attrs) (cons '[class "PhD"] attrs)))
'(p ((class "PhD") (id "first")) "If I only had a brain.")

Note that txexpr-attrs-proc will change the attributes of every tagged X-expression, even those that don’t have attributes. This is useful, because sometimes you want to add attributes where none existed before. But be careful, because the behavior may make your processing function overinclusive.

Examples:

> (define tx '(div (p [[id "first"]] "If I only had a brain.")
  (p "Me too.")))
; This will insert the new attribute everywhere
> (decode tx #:txexpr-attrs-proc (λ(attrs) (cons '[class "PhD"] attrs)))
'(div
  ((class "PhD"))
  (p ((class "PhD") (id "first")) "If I only had a brain.")
  (p ((class "PhD")) "Me too."))
; This will add the new attribute only to non-null attribute lists
> (decode tx #:txexpr-attrs-proc
  (λ(attrs) (if (null? attrs) attrs (cons '[class "PhD"] attrs))))
'(div (p ((class "PhD") (id "first")) "If I only had a brain.") (p "Me too."))

The txexpr-elements-proc argument is a procedure that operates on the list of elements that represents the content of each tagged X-expression. Note that each element of an X-expression is subject to two passes through the decoder: once now, as a member of the list of elements, and also later, through its type-specific decoder (i.e., string-proc, symbol-proc, and so on).

Examples:

> (define tx '(div "Double" "\n" "toil" amp "trouble"))
; Every element gets doubled ...
> (decode tx #:txexpr-elements-proc (λ(es) (append-map (λ(e) `(,e ,e)) es)))
'(div "Double" "Double" "\n" "\n" "toil" "toil" amp amp "trouble" "trouble")
; ... but only strings get capitalized
> (decode tx #:txexpr-elements-proc (λ(es) (append-map (λ(e) `(,e ,e)) es))
#:string-proc (λ(s) (string-upcase s)))
'(div "DOUBLE" "DOUBLE" "\n" "\n" "TOIL" "TOIL" amp amp "TROUBLE" "TROUBLE")

So why do you need txexpr-elements-proc? Because some types of element decoding depend on context, thus it’s necessary to handle the elements as a group. For instance, the doubling function above, though useless, requires handling the element list as a whole, because elements are being added.

A more useful example: paragraph detection. The behavior is not merely a map across each element:

Examples:

> (define (paras tx) (decode tx #:txexpr-elements-proc detect-paragraphs))
; Context matters. Trailing whitespace is ignored ...
> (paras '(body "The first paragraph." "\n\n"))
'(body "The first paragraph.")
; ... but whitespace between strings is converted to a break.
> (paras '(body "The first paragraph." "\n\n" "And another."))
'(body (p "The first paragraph.") (p "And another."))
; A combination of both types
> (paras '(body "The first paragraph." "\n\n" "And another." "\n\n"))
'(body (p "The first paragraph.") (p "And another."))

The block-txexpr-proc argument and the inline-txexpr-proc arguments are procedures that operate on tagged X-expressions. If the X-expression meets the block-txexpr? test, it’s processed by block-txexpr-proc. Otherwise, it’s inline, so it’s processed by inline-txexpr-proc. (Careful, however — these aren’t mutually exclusive, because block-txexpr-proc operates on all the elements of a block, including other tagged X-expressions within.)

Of course, if you want block and inline elements to be handled the same way, you can set block-txexpr-proc and inline-txexpr-proc to be the same procedure.

Examples:

> (define tx '(div "Please" (em "mind the gap") (h1 "Tuesdays only")))
> (define add-ns (λ(tx) (make-txexpr
      (string->symbol (format "ns:~a" (get-tag tx)))
      (get-attrs tx)
      (get-elements tx))))
; div and h1 are block elements, so this will only affect them
> (decode tx #:block-txexpr-proc add-ns)
'(ns:div "Please" (em "mind the gap") (ns:h1 "Tuesdays only"))
; em is an inline element, so this will only affect it
> (decode tx #:inline-txexpr-proc add-ns)
'(div "Please" (ns:em "mind the gap") (h1 "Tuesdays only"))
; this will affect all elements
> (decode tx #:block-txexpr-proc add-ns #:inline-txexpr-proc add-ns)
'(ns:div "Please" (ns:em "mind the gap") (ns:h1 "Tuesdays only"))

The string-proc, symbol-proc, valid-char-proc, and cdata-proc arguments are procedures that operate on X-expressions that are strings, symbols, valid-chars, and CDATA, respectively. Deliberately, the output contracts for these procedures accept any kind of X-expression (meaning, the procedure can change the X-expression type).

Examples:

; A div with string, entity, character, and cdata elements
> (define tx `(div "Moe" amp 62 ,(cdata #f #f "3 > 2;")))
> (define rulify (λ(x) '(hr)))
; The rulify function is selectively applied to each
> (print (decode tx #:string-proc rulify))
'(div (hr) amp 62 #(struct:cdata #f #f "3 > 2;"))
> (print (decode tx #:symbol-proc rulify))
'(div "Moe" (hr) 62 #(struct:cdata #f #f "3 > 2;"))
> (print (decode tx #:valid-char-proc rulify))
'(div "Moe" amp (hr) #(struct:cdata #f #f "3 > 2;"))
> (print (decode tx #:cdata-proc rulify))
'(div "Moe" amp 62 (hr))

Finally, the tags-to-exclude argument is a list of tags that will be exempted from decoding. Though you could get the same result by testing the input within the individual decoding functions, that’s tedious and potentially slower.

Examples:

> (define tx '(p "I really think" (em "italics") "should be lowercase."))
> (decode tx #:string-proc (λ(s) (string-upcase s)))
'(p "I REALLY THINK" (em "ITALICS") "SHOULD BE LOWERCASE.")
> (decode tx #:string-proc (λ(s) (string-upcase s)) #:exclude-tags '(em))
'(p "I REALLY THINK" (em "italics") "SHOULD BE LOWERCASE.")

The tags-to-exclude argument is useful if you’re decoding source that’s destined to become HTML. According to the HTML spec, material within a <style> or <script> block needs to be preserved literally. In this example, if the CSS and JavaScript blocks are capitalized, they won’t work. So exclude '(style script), and problem solved.

Examples:

> (define tx '(body (h1 [[class "Red"]] "Let's visit Planet Telex.")
  (style [[type "text/css"]] ".Red {color: green;}")
  (script [[type "text/javascript"]] "var area = h * w;")))
> (decode tx #:string-proc (λ(s) (string-upcase s)))
'(body
  (h1 ((class "Red")) "LET'S VISIT PLANET TELEX.")
  (style ((type "text/css")) ".RED {COLOR: GREEN;}")
  (script ((type "text/javascript")) "VAR AREA = H * W;"))
> (decode tx #:string-proc (λ(s) (string-upcase s))
  #:exclude-tags '(style script))
'(body
  (h1 ((class "Red")) "LET'S VISIT PLANET TELEX.")
  (style ((type "text/css")) ".Red {color: green;}")
  (script ((type "text/javascript")) "var area = h * w;"))

procedure
(decode-elements elements
[ #:txexpr-tag-proc txexpr-tag-proc
#:txexpr-attrs-proc txexpr-attrs-proc
#:txexpr-elements-proc txexpr-elements-proc
#:block-txexpr-proc block-txexpr-proc
#:inline-txexpr-proc inline-txexpr-proc
#:string-proc string-proc
#:symbol-proc symbol-proc
#:valid-char-proc valid-char-proc
#:cdata-proc cdata-proc
#:exclude-tags tags-to-exclude])
→ txexpr-elements?
  elements : txexpr-elements?
   txexpr-tag-proc : (txexpr-tag? . -> . txexpr-tag?)
= (λ(tag) tag)
   txexpr-attrs-proc : (txexpr-attrs? . -> . txexpr-attrs?)
= (λ(attrs) attrs)
   txexpr-elements-proc : (txexpr-elements? . -> . txexpr-elements?)
= (λ(elements) elements)
  block-txexpr-proc : (block-txexpr? . -> . xexpr?) = (λ(tx) tx)
  inline-txexpr-proc : (txexpr? . -> . xexpr?) = (λ(tx) tx)
  string-proc : (string? . -> . xexpr?) = (λ(str) str)
  symbol-proc : (symbol? . -> . xexpr?) = (λ(sym) sym)
  valid-char-proc : (valid-char? . -> . xexpr?) = (λ(vc) vc)
  cdata-proc : (cdata? . -> . xexpr?) = (λ(cdata) cdata)
  tags-to-exclude : (listof symbol?) = null

Identical to decode, but takes txexpr-elements? as input rather than a whole tagged X-expression, and likewise returns txexpr-elements? rather than a tagged X-expression. A convenience variant for use inside tag functions.

11.2.1 Block

Because it’s convenient, Pollen puts tagged X-expressions into two categories: block and inline. Why is it convenient? When using decode, you often want to treat the two categories differently. Not that you have to. But this is how you can.

parameter
(project-block-tags) → (listof txexpr-tag?)
(project-block-tags block-tags) → void?
block-tags : (listof txexpr-tag?)

A parameter that defines the set of tags that decode will treat as blocks. This parameter is initialized with the HTML block tags, namely:

(address article aside audio blockquote body canvas dd div dl fieldset figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup noscript ol output p pre section table tfoot ul video)

procedure
(register-block-tag tag) → void?
tag : txexpr-tag?

Adds a tag to project-block-tags so that block-txexpr? will report it as a block, and decode will process it with block-txexpr-proc rather than inline-txexpr-proc.

Pollen tries to do the right thing without being told. But this is the rare case where you have to be explicit. If you introduce a tag into your markup that you want treated as a block, you must use this function to identify it, or you will get spooky behavior later on.

For instance, detect-paragraphs knows that block elements in the markup shouldn’t be wrapped in a p tag. So if you introduce a new block element called bloq without registering it as a block, misbehavior will follow:

Examples:

> (define (paras tx) (decode tx #:txexpr-elements-proc detect-paragraphs))
> (paras '(body "I want to be a paragraph." "\n\n" (bloq "But not me.")))
'(body (p "I want to be a paragraph.") (p (bloq "But not me.")))
; Wrong: bloq should not be wrapped

But once you register bloq as a block, order is restored:

Examples:

> (define (paras tx) (decode tx #:txexpr-elements-proc detect-paragraphs))
> (register-block-tag 'bloq)
> (paras '(body "I want to be a paragraph." "\n\n" (bloq "But not me.")))
'(body (p "I want to be a paragraph.") (bloq "But not me."))
; Right: bloq is treated as a block

If you find the idea of registering block tags unbearable, good news. The project-block-tags include the standard HTML block tags by default. So if you just want to use things like div and p and h1–h6, you’ll get the right behavior for free.

Examples:

> (define (paras tx) (decode tx #:txexpr-elements-proc detect-paragraphs))
> (paras '(body "I want to be a paragraph." "\n\n" (div "But not me.")))
'(body (p "I want to be a paragraph.") (div "But not me."))

procedure
(block-txexpr? v) → boolean?
v : any/c

Predicate that tests whether v is a tagged X-expression, and if so, whether the tag is among the project-block-tags. If not, it is treated as inline. To adjust how this test works, use register-block-tag.

11.2.2 Typography

An assortment of typography & layout functions, designed to be used with decode. These aren’t hard to write. So if you like these, use them. If not, make your own.

procedure
(whitespace? v) → boolean?
v : any/c

A predicate that returns #t for any stringlike v that’s entirely whitespace, but also the empty string, as well as lists and vectors that are made only of whitespace? members. Following the regexp-match convention, whitespace? does not return #t for a nonbreaking space. If you prefer that behavior, use whitespace/nbsp?.

Examples:

> (whitespace? "\n\n   ")
#t
> (whitespace? (string->symbol "\n\n   "))
#t
> (whitespace? "")
#t
> (whitespace? '("" "  " "\n\n\n" " \n"))
#t
> (define nonbreaking-space (format "~a" #\u00A0))
> (whitespace? nonbreaking-space)
#f

procedure
(whitespace/nbsp? v) → boolean?
v : any/c

Like whitespace?, but also returns #t for nonbreaking spaces.

Examples:

> (whitespace/nbsp? "\n\n   ")
#t
> (whitespace/nbsp? (string->symbol "\n\n   "))
#t
> (whitespace/nbsp? "")
#t
> (whitespace/nbsp? '("" "  " "\n\n\n" " \n"))
#t
> (define nonbreaking-space (format "~a" #\u00A0))
> (whitespace/nbsp? nonbreaking-space)
#t

procedure
(smart-quotes str) → string?
str : string?

Convert straight quotes in str to curly according to American English conventions.

Examples:

> (define tricky-string
"\"Why,\" she could've asked, \"are we in O‘ahu watching 'Mame'?\"")
> (display tricky-string)
"Why," she could've asked, "are we in O‘ahu watching 'Mame'?"
> (display (smart-quotes tricky-string))
“Why,” she could’ve asked, “are we in O‘ahu watching ‘Mame’?”

procedure
(smart-dashes str) → string?
str : string?

In str, convert three hyphens to an em dash, and two hyphens to an en dash, and remove surrounding spaces.

Examples:

> (define tricky-string "I had a few --- OK, like 6--8 --- thin mints.")
> (display tricky-string)
I had a few --- OK, like 6--8 --- thin mints.
> (display (smart-dashes tricky-string))
I had a few—OK, like 6–8—thin mints.
; Monospaced font not great for showing dashes, but you get the idea

procedure
(detect-linebreaks tagged-xexpr-elements
[ #:separator linebreak-sep
#:insert linebreak])
→ txexpr-elements?
  tagged-xexpr-elements : txexpr-elements?
  linebreak-sep : string? = world:linebreak-separator
  linebreak : xexpr? = '(br)

Within tagged-xexpr-elements, convert occurrences of linebreak-sep ("\n" by default) to linebreak, but only if linebreak-sep does not occur between blocks (see block-txexpr?). Why? Because block-level elements automatically display on a new line, so adding linebreak would be superfluous. In that case, linebreak-sep just disappears.

Examples:

> (detect-linebreaks '(div "Two items:" "\n" (em "Eggs") "\n" (em "Bacon")))
'(div "Two items:" (br) (em "Eggs") (br) (em "Bacon"))
> (detect-linebreaks '(div "Two items:" "\n" (div "Eggs") "\n" (div "Bacon")))
'(div "Two items:" (div "Eggs") (div "Bacon"))

procedure
(detect-paragraphs elements
[ #:separator paragraph-sep
#:tag paragraph-tag
#:linebreak-proc linebreak-proc])
→ txexpr-elements?
  elements : txexpr-elements?
  paragraph-sep : string? = world:paragraph-separator
  paragraph-tag : symbol? = 'p
   linebreak-proc : (txexpr-elements? . -> . txexpr-elements?)
= detect-linebreaks

Find paragraphs within elements (as denoted by paragraph-sep) and wrap them with paragraph-tag. Also handle linebreaks using detect-linebreaks.

If element is already a block-txexpr?, it will not be wrapped as a paragraph (because in that case, the wrapping would be superfluous). Thus, as a consequence, if paragraph-sep occurs between two blocks, it will be ignored (as in the example below using two sequential 'div blocks.)

The paragraph-tag argument sets the tag used to wrap paragraphs.

The linebreak-proc argument allows you to use a different linebreaking procedure other than the usual detect-linebreaks.

Examples:

> (detect-paragraphs '("First para" "\n\n" "Second para"))
'((p "First para") (p "Second para"))
> (detect-paragraphs '("First para" "\n\n" "Second para" "\n" "Second line"))
'((p "First para") (p "Second para" (br) "Second line"))
> (detect-paragraphs '("First para" "\n\n" (div "Second block")))
'((p "First para") (div "Second block"))
> (detect-paragraphs '((div "First block") "\n\n" (div "Second block")))
'((div "First block") (div "Second block"))
> (detect-paragraphs '("First para" "\n\n" "Second para") #:tag 'ns:p)
'((ns:p "First para") (ns:p "Second para"))
> (detect-paragraphs '("First para" "\n\n" "Second para" "\n" "Second line")
#:linebreak-proc (λ(x) (detect-linebreaks x #:insert '(newline))))
'((p "First para") (p "Second para" (newline) "Second line"))

procedure
(wrap-hanging-quotes tx
[ #:single-preprend single-preprender
#:double-preprend double-preprender])
→ txexpr?
  tx : txexpr?
  single-preprender : txexpr-tag? = 'squo
  double-preprender : txexpr-tag? = 'dquo

Find single or double quote marks at the beginning of tx and wrap them in an X-expression with the tag single-preprender or double-preprender, respectively. The default values are 'squo and 'dquo.

Examples:

> (wrap-hanging-quotes '(p "No quote to hang."))
'(p "No quote to hang.")
> (wrap-hanging-quotes '(p "“What? We need to hang quotes?”"))
'(p (dquo "“" "What? We need to hang quotes?”"))

In pro typography, quotation marks at the beginning of a line or paragraph are often shifted into the margin slightly to make them appear more optically aligned with the left edge of the text. With a reflowable layout model like HTML, you don’t know where your line breaks will be.

This function will simply insert the 'squo and 'dquo tags, which provide hooks that let you do the actual hanging via CSS, like so (actual measurement can be refined to taste):

squo {margin-left: -0.25em;}

dquo {margin-left: -0.50em;}

Be warned: there are many edge cases this function does not handle well.

Examples:

; Argh: this edge case is not handled properly
> (wrap-hanging-quotes '(p "“" (em "What?") "We need to hang quotes?”"))
'(p "“" (em "What?") "We need to hang quotes?”")

← prev up next →

1	Installation
2	Quick tour
3	Backstory
4	The big picture
5	First tutorial
6	Second tutorial
7	Third tutorial
8	Using raco pollen
9	File formats
10	◊ command overview
11	Module reference
12	Acknowledgments
13	License & source code
	Index

11.1	Cache
11.2	Decode
11.3	File
11.4	Pagetree
11.5	Render
11.6	Template
11.7	Tag
11.8	Top
11.9	World