# Hyphenate Matthew Butterick <[mb@mbtype.com](mailto:mb@mbtype.com)> ```racket (require hyphenate) package: [hyphenate](https://pkgs.racket-lang.org/package/hyphenate) (require (submod hyphenate safe)) ``` A simple hyphenation engine that uses the Knuth–Liang hyphenation algorithm originally developed for TeX. I have added little to their work. Accordingly, I take little credit. ## 1. Installation At the command line: `raco pkg install hyphenate` After that, you can update the package like so: `raco pkg update hyphenate` ## 2. Importing the module The module can be invoked two ways: fast or safe. Fast mode is the default, which you get by importing the module in the usual way: `(require` `hyphenate)`. Safe mode enables the function contracts documented below. Use safe mode by importing the module as `(require` `(submod` `hyphenate` `safe))`. ## 3. Interface ```racket (hyphenate xexpr [joiner #:exceptions exceptions #:min-length length #:min-left-length left-length #:min-right-length right-length #:omit-word word-test #:omit-string string-test #:omit-txexpr txexpr-test]) -> xexpr/c xexpr : xexpr/c joiner : (or/c char? string?) = (integer->char 173) exceptions : (listof string?) = empty length : (or/c integer? false?) = 5 left-length : (or/c (and/c integer? positive?) #f) = 2 right-length : (or/c (and/c integer? positive?) #f) = 2 word-test : (string? . -> . any/c) = (λ(x) #f) string-test : (string? . -> . any/c) = (λ(x) #f) txexpr-test : (txexpr? . -> . any/c) = (λ(x) #f) ``` Hyphenate `xexpr` by calculating hyphenation points and inserting `joiner` at those points. By default, `joiner` is the soft hyphen \(Unicode 00AD = decimal 173\). Words shorter than `#:min-length` `length` will not be hyphenated. To hyphenate words of any length, use `#:min-length` `#f`. > The REPL displays a soft hyphen as `\u00AD`. But in ordinary use, you’ll > only see a soft hyphen when it appears at the end of a line or page as > part of a hyphenated word. Otherwise it’s not displayed. In most of the > examples here, I use a standard hyphen for clarity \(by adding `#\-` as > an argument\). Examples: ```racket > (hyphenate "ergo polymorphism") "ergo poly\u00ADmor\u00ADphism" > (hyphenate "ergo polymorphism" #\-) "ergo poly-mor-phism" > (hyphenate "ergo polymorphism" #:min-length 13) "ergo polymorphism" > (hyphenate "ergo polymorphism" #:min-length #f) "ergo poly\u00ADmor\u00ADphism" ``` The `#:min-left-length` and `#:min-right-length` keyword arguments set the minimum distance between a potential hyphen and the left or right ends of the word. The default is 2 characters. Larger values will reduce hyphens, but also prevent small words from breaking. These values will override a smaller `#:min-length` value. Examples: ```racket > (hyphenate "ergo polymorphism" #\-) "ergo poly-mor-phism" > (hyphenate "ergo polymorphism" #\- #:min-left-length #f) "ergo poly-mor-phism" > (hyphenate "ergo polymorphism" #\- #:min-length 2 #:min-left-length 5) "ergo polymor-phism" > (hyphenate "ergo polymorphism" #\- #:min-right-length 6) "ergo poly-morphism" ; Next words won't be hyphenated becase of large #:min-left-length > (hyphenate "ergo polymorphism" #\- #:min-length #f #:min-left-length 15) "ergo polymorphism" ``` Because the hyphenation is based on an algorithm rather than a dictionary, it makes good guesses with unusual words: Examples: ```racket > (hyphenate "scraunched strengths" #\-) "scraunched strengths" > (hyphenate "RacketCon" #\-) "Rack-et-Con" > (hyphenate "supercalifragilisticexpialidocious" #\-) "su-per-cal-ifrag-ilis-tic-ex-pi-ali-do-cious" ``` Using the `#:exceptions` keyword, you can pass hyphenation exceptions as a list of words with hyphenation points marked with regular hyphens \(`"-"`\). If an exception word contains no hyphens, that word will never be hyphenated. Examples: ```racket > (hyphenate "polymorphism" #\-) "poly-mor-phism" > (hyphenate "polymorphism" #\- #:exceptions '("polymo-rphism")) "polymo-rphism" > (hyphenate "polymorphism" #\- #:exceptions '("polymorphism")) "polymorphism" ``` Knuth & Liang were sufficiently confident about their algorithm that they originally released it with only 14 exceptions: _associate\[s\], declination, obligatory, philanthropic, present\[s\], project\[s\], reciprocity, recognizance, reformation, retribution_, and _table_. Admirable bravado, but it’s not hard to discover others that need adjustment. Examples: ```racket > (hyphenate "wrong: columns signage lawyers" #\-) "wrong: columns sig-nage law-yers" > (hyphenate "right: columns signage lawyers" #\- #:exceptions '("col-umns" "sign-age" "law-yers")) "right: col-umns sign-age law-yers" ``` The Knuth–Liang algorithm is designed to omit legitimate hyphenation points \(i.e., generate false negatives\) more often than it creates erroneous hyphenation points \(i.e., false positives\). This is good policy. Perfect hyphenation — that is, hyphenation that represents an exact linguistic syllabification of each word — is superfluous for typesetting. Hyphenation simply seeks to mark possible line-break and page-break locations for whatever layout engine is drawing the text. The ultimate goal is to permit more even text flow. Like horseshoes and hand grenades, close is good enough. And a word wrongly hyphenated is more likely to be noticed by a reader than a word inefficiently hyphenated. For this reason, certain words can’t be hyphenated algorithmically, because the correct hyphenation depends on meaning, not merely on spelling. For instance: Example: ```racket > (hyphenate "adder") "adder" ``` This is the right result. If you used _adder_ to mean the machine, it would be hyphenated _add-er_; if you meant the snake, it would be _ad-der_. Better to avoid hyphenation than to hyphenate incorrectly. You can send HTML-style X-expressions through `hyphenate`. It will recursively hyphenate the text strings, while leaving the tags and attributes alone, as well as non-hyphenatable material \(like character entities and CDATA\). Examples: ```racket > (hyphenate '(p "strangely" (em "formatted" (strong "snowmen"))) #\-) '(p "strange-ly" (em "for-mat-ted" (strong "snow-men"))) > (hyphenate '(headline [[class "headline"]] "headline") #\-) '(headline ((class "headline")) "head-line") > (hyphenate '(div "The (span epsilon) entity:" epsilon) #\-) '(div "The (span ep-silon) en-ti-ty:" epsilon) ``` Don’t send raw HTML or XML through `hyphenate`. It can’t distinguish tags and attributes from textual content, so everything will be hyphenated, thus goofing up your file. But you can easily convert your HTML or XML to an X-expression, hyphenate it, and then convert back. Examples: ```racket > (define html "Hello") > (hyphenate html #\-) "Hel-lo" > (xexpr->string (hyphenate (string->xexpr html) #\-)) "Hel-lo" ``` If you’re working with HTML, be careful not to include any `