diff --git a/quad/qtest/hyphenate.md b/quad/qtest/hyphenate.md new file mode 100644 index 00000000..4342612e --- /dev/null +++ b/quad/qtest/hyphenate.md @@ -0,0 +1,413 @@ +# Hyphenate + +Matthew Butterick <[mb@mbtype.com](mailto:mb@mbtype.com)> + +```racket + (require hyphenate) package: [hyphenate](https://pkgs.racket-lang.org/package/hyphenate) + (require (submod hyphenate safe)) +``` + +A simple hyphenation engine that uses the Knuth–Liang hyphenation +algorithm originally developed for TeX. I have added little to their +work. Accordingly, I take little credit. + +## 1. Installation + +At the command line: + +`raco pkg install hyphenate` + +After that, you can update the package like so: + +`raco pkg update hyphenate` + +## 2. Importing the module + +The module can be invoked two ways: fast or safe. + +Fast mode is the default, which you get by importing the module in the +usual way: `(require` `hyphenate)`. + +Safe mode enables the function contracts documented below. Use safe mode +by importing the module as `(require` `(submod` `hyphenate` `safe))`. + +## 3. Interface + +```racket +(hyphenate xexpr + [joiner + #:exceptions exceptions + #:min-length length + #:min-left-length left-length + #:min-right-length right-length + #:omit-word word-test + #:omit-string string-test + #:omit-txexpr txexpr-test]) -> xexpr/c + xexpr : xexpr/c + joiner : (or/c char? string?) = (integer->char 173) + exceptions : (listof string?) = empty + length : (or/c integer? false?) = 5 + left-length : (or/c (and/c integer? positive?) #f) = 2 + right-length : (or/c (and/c integer? positive?) #f) = 2 + word-test : (string? . -> . any/c) = (λ(x) #f) + string-test : (string? . -> . any/c) = (λ(x) #f) + txexpr-test : (txexpr? . -> . any/c) = (λ(x) #f) +``` + +Hyphenate `xexpr` by calculating hyphenation points and inserting +`joiner` at those points. By default, `joiner` is the soft hyphen +\(Unicode 00AD = decimal 173\). Words shorter than +`#:min-length` `length` will not be hyphenated. To hyphenate words of +any length, use `#:min-length` `#f`. + +> The REPL displays a soft hyphen as `\u00AD`. But in ordinary use, you’ll +> only see a soft hyphen when it appears at the end of a line or page as +> part of a hyphenated word. Otherwise it’s not displayed. In most of the +> examples here, I use a standard hyphen for clarity \(by adding `#\-` as +> an argument\). + +Examples: + +```racket +> (hyphenate "ergo polymorphism") +"ergo poly\u00ADmor\u00ADphism" +> (hyphenate "ergo polymorphism" #\-) +"ergo poly-mor-phism" +> (hyphenate "ergo polymorphism" #:min-length 13) +"ergo polymorphism" +> (hyphenate "ergo polymorphism" #:min-length #f) +"ergo poly\u00ADmor\u00ADphism" +``` + +The `#:min-left-length` and `#:min-right-length` keyword arguments set +the minimum distance between a potential hyphen and the left or right +ends of the word. The default is 2 characters. Larger values will reduce +hyphens, but also prevent small words from breaking. These values will +override a smaller `#:min-length` value. + +Examples: + +```racket +> (hyphenate "ergo polymorphism" #\-) +"ergo poly-mor-phism" +> (hyphenate "ergo polymorphism" #\- #:min-left-length #f) +"ergo poly-mor-phism" +> (hyphenate "ergo polymorphism" #\- #:min-length 2 #:min-left-length 5) +"ergo polymor-phism" +> (hyphenate "ergo polymorphism" #\- #:min-right-length 6) +"ergo poly-morphism" +; Next words won't be hyphenated becase of large #:min-left-length +> (hyphenate "ergo +polymorphism" #\- #:min-length #f #:min-left-length 15) +"ergo polymorphism" +``` + +Because the hyphenation is based on an algorithm rather than a +dictionary, it makes good guesses with unusual words: + +Examples: + +```racket +> (hyphenate "scraunched strengths" #\-) +"scraunched strengths" +> (hyphenate "RacketCon" #\-) +"Rack-et-Con" +> (hyphenate "supercalifragilisticexpialidocious" #\-) +"su-per-cal-ifrag-ilis-tic-ex-pi-ali-do-cious" +``` + +Using the `#:exceptions` keyword, you can pass hyphenation exceptions as +a list of words with hyphenation points marked with regular hyphens +\(`"-"`\). If an exception word contains no hyphens, that word will +never be hyphenated. + +Examples: + +```racket +> (hyphenate "polymorphism" #\-) +"poly-mor-phism" +> (hyphenate "polymorphism" #\- #:exceptions '("polymo-rphism")) +"polymo-rphism" +> (hyphenate "polymorphism" #\- #:exceptions '("polymorphism")) +"polymorphism" +``` + +Knuth & Liang were sufficiently confident about their algorithm that +they originally released it with only 14 exceptions: _associate\[s\], +declination, obligatory, philanthropic, present\[s\], project\[s\], +reciprocity, recognizance, reformation, retribution_, and _table_. +Admirable bravado, but it’s not hard to discover others that need +adjustment. + +Examples: + +```racket +> (hyphenate "wrong: columns signage lawyers" #\-) +"wrong: columns sig-nage law-yers" +> (hyphenate "right: columns signage lawyers" #\- + #:exceptions '("col-umns" "sign-age" "law-yers")) +"right: col-umns sign-age law-yers" +``` + +The Knuth–Liang algorithm is designed to omit legitimate hyphenation +points \(i.e., generate false negatives\) more often than it creates +erroneous hyphenation points \(i.e., false positives\). This is good +policy. Perfect hyphenation — that is, hyphenation that represents an +exact linguistic syllabification of each word — is superfluous for +typesetting. Hyphenation simply seeks to mark possible line-break and +page-break locations for whatever layout engine is drawing the text. The +ultimate goal is to permit more even text flow. Like horseshoes and hand +grenades, close is good enough. And a word wrongly hyphenated is more +likely to be noticed by a reader than a word inefficiently hyphenated. + +For this reason, certain words can’t be hyphenated algorithmically, +because the correct hyphenation depends on meaning, not merely on +spelling. For instance: + +Example: + +```racket +> (hyphenate "adder") +"adder" +``` + +This is the right result. If you used _adder_ to mean the machine, it +would be hyphenated _add-er_; if you meant the snake, it would be +_ad-der_. Better to avoid hyphenation than to hyphenate incorrectly. + +You can send HTML-style X-expressions through `hyphenate`. It will +recursively hyphenate the text strings, while leaving the tags and +attributes alone, as well as non-hyphenatable material \(like character +entities and CDATA\). + +Examples: + +```racket +> (hyphenate '(p "strangely" (em "formatted" (strong "snowmen"))) #\-) +'(p "strange-ly" (em "for-mat-ted" (strong "snow-men"))) +> (hyphenate '(headline [[class "headline"]] "headline") #\-) +'(headline ((class "headline")) "head-line") +> (hyphenate '(div "The (span epsilon) entity:" epsilon) #\-) +'(div "The (span ep-silon) en-ti-ty:" epsilon) +``` + +Don’t send raw HTML or XML through `hyphenate`. It can’t distinguish +tags and attributes from textual content, so everything will be +hyphenated, thus goofing up your file. But you can easily convert your +HTML or XML to an X-expression, hyphenate it, and then convert back. + +Examples: + +```racket +> (define html "Hello") +> (hyphenate html #\-) +"Hel-lo" +> (xexpr->string (hyphenate (string->xexpr html) #\-)) +"Hel-lo" +``` + +If you’re working with HTML, be careful not to include any `