You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
typesetting/quad/qtest/hyphenate.md

16 KiB

Hyphenate

Matthew Butterick <mb@mbtype.com>

 (require hyphenate)               package: [hyphenate](https://pkgs.racket-lang.org/package/hyphenate)
 (require (submod hyphenate safe))

A simple hyphenation engine that uses the KnuthLiang hyphenation algorithm originally developed for TeX. I have added little to their work. Accordingly, I take little credit.

1. Installation

At the command line:

raco pkg install hyphenate

After that, you can update the package like so:

raco pkg update hyphenate

2. Importing the module

The module can be invoked two ways: fast or safe.

Fast mode is the default, which you get by importing the module in the usual way: (require hyphenate).

Safe mode enables the function contracts documented below. Use safe mode by importing the module as (require (submod hyphenate safe)).

3. Interface

(hyphenate  xexpr                                        
           [joiner                                       
            #:exceptions exceptions                      
            #:min-length length                          
            #:min-left-length left-length                
            #:min-right-length right-length              
            #:omit-word word-test                        
            #:omit-string string-test                    
            #:omit-txexpr txexpr-test])     -> xexpr/c   
  xexpr : xexpr/c                                        
  joiner : (or/c char? string?) = (integer->char 173)    
  exceptions : (listof string?) = empty                  
  length : (or/c integer? false?) = 5                    
  left-length : (or/c (and/c integer? positive?) #f) = 2 
  right-length : (or/c (and/c integer? positive?) #f) = 2
  word-test : (string? . -> . any/c) = (λ(x) #f)         
  string-test : (string? . -> . any/c) = (λ(x) #f)       
  txexpr-test : (txexpr? . -> . any/c) = (λ(x) #f)       

Hyphenate xexpr by calculating hyphenation points and inserting joiner at those points. By default, joiner is the soft hyphen Unicode 00AD = decimal 173. Words shorter than #:min-length length will not be hyphenated. To hyphenate words of any length, use #:min-length #f.

The REPL displays a soft hyphen as \u00AD. But in ordinary use, youll only see a soft hyphen when it appears at the end of a line or page as part of a hyphenated word. Otherwise its not displayed. In most of the examples here, I use a standard hyphen for clarity (by adding #\- as an argument).

Examples:

> (hyphenate "ergo polymorphism")                
"ergo poly\u00ADmor\u00ADphism"                  
> (hyphenate "ergo polymorphism" #\-)            
"ergo poly-mor-phism"                            
> (hyphenate "ergo polymorphism" #:min-length 13)
"ergo polymorphism"                              
> (hyphenate "ergo polymorphism" #:min-length #f)
"ergo poly\u00ADmor\u00ADphism"                  

The #:min-left-length and #:min-right-length keyword arguments set the minimum distance between a potential hyphen and the left or right ends of the word. The default is 2 characters. Larger values will reduce hyphens, but also prevent small words from breaking. These values will override a smaller #:min-length value.

Examples:

> (hyphenate "ergo polymorphism" #\-)                                   
"ergo poly-mor-phism"                                                   
> (hyphenate "ergo polymorphism" #\- #:min-left-length #f)              
"ergo poly-mor-phism"                                                   
> (hyphenate "ergo polymorphism" #\- #:min-length 2 #:min-left-length 5)
"ergo polymor-phism"                                                    
> (hyphenate "ergo polymorphism" #\- #:min-right-length 6)              
"ergo poly-morphism"                                                    
; Next words won't be hyphenated becase of large #:min-left-length      
> (hyphenate "ergo                                                      
polymorphism" #\- #:min-length #f #:min-left-length 15)                 
"ergo polymorphism"                                                     

Because the hyphenation is based on an algorithm rather than a dictionary, it makes good guesses with unusual words:

Examples:

> (hyphenate "scraunched strengths" #\-)              
"scraunched strengths"                                
> (hyphenate "RacketCon" #\-)                         
"Rack-et-Con"                                         
> (hyphenate "supercalifragilisticexpialidocious" #\-)
"su-per-cal-ifrag-ilis-tic-ex-pi-ali-do-cious"        

Using the #:exceptions keyword, you can pass hyphenation exceptions as a list of words with hyphenation points marked with regular hyphens `"-"`. If an exception word contains no hyphens, that word will never be hyphenated.

Examples:

> (hyphenate "polymorphism" #\-)                                
"poly-mor-phism"                                                
> (hyphenate "polymorphism" #\- #:exceptions '("polymo-rphism"))
"polymo-rphism"                                                 
> (hyphenate "polymorphism" #\- #:exceptions '("polymorphism")) 
"polymorphism"                                                  

Knuth & Liang were sufficiently confident about their algorithm that they originally released it with only 14 exceptions: associate[s], declination, obligatory, philanthropic, present[s], project[s], reciprocity, recognizance, reformation, retribution, and table. Admirable bravado, but its not hard to discover others that need adjustment.

Examples:

> (hyphenate "wrong: columns signage lawyers" #\-) 
"wrong: columns sig-nage law-yers"                 
> (hyphenate "right: columns signage lawyers" #\-  
  #:exceptions '("col-umns" "sign-age" "law-yers"))
"right: col-umns sign-age law-yers"                

The KnuthLiang algorithm is designed to omit legitimate hyphenation points i.e., generate false negatives more often than it creates erroneous hyphenation points i.e., false positives. This is good policy. Perfect hyphenation — that is, hyphenation that represents an exact linguistic syllabification of each word — is superfluous for typesetting. Hyphenation simply seeks to mark possible line-break and page-break locations for whatever layout engine is drawing the text. The ultimate goal is to permit more even text flow. Like horseshoes and hand grenades, close is good enough. And a word wrongly hyphenated is more likely to be noticed by a reader than a word inefficiently hyphenated.

For this reason, certain words cant be hyphenated algorithmically, because the correct hyphenation depends on meaning, not merely on spelling. For instance:

Example:

> (hyphenate "adder")
"adder"              

This is the right result. If you used adder to mean the machine, it would be hyphenated add-er; if you meant the snake, it would be ad-der. Better to avoid hyphenation than to hyphenate incorrectly.

You can send HTML-style X-expressions through hyphenate. It will recursively hyphenate the text strings, while leaving the tags and attributes alone, as well as non-hyphenatable material (like character entities and CDATA).

Examples:

> (hyphenate '(p "strangely" (em "formatted" (strong "snowmen"))) #\-)
'(p "strange-ly" (em "for-mat-ted" (strong "snow-men")))              
> (hyphenate '(headline [[class "headline"]] "headline") #\-)         
'(headline ((class "headline")) "head-line")                          
> (hyphenate '(div "The (span epsilon) entity:" epsilon) #\-)         
'(div "The (span ep-silon) en-ti-ty:" epsilon)                        

Dont send raw HTML or XML through hyphenate. It cant distinguish tags and attributes from textual content, so everything will be hyphenated, thus goofing up your file. But you can easily convert your HTML or XML to an X-expression, hyphenate it, and then convert back.

Examples:

> (define html "<body style=\"background: yellow\">Hello</body>")
> (hyphenate html #\-)                                           
"<body style=\"back-ground: yel-low\">Hel-lo</body>"             
> (xexpr->string (hyphenate (string->xexpr html) #\-))           
"<body style=\"background: yellow\">Hel-lo</body>"               

If youre working with HTML, be careful not to include any <script> or <style> blocks, which contain non-hyphenatable data. You can protect that data by using the #:omit-txexpr keyword to specify a txexpr-test. The test will be applied to all tagged X-expressions see `txexpr?`. When txexpr-test evaluates to true, the item will be skipped.

Examples:

> (hyphenate '(body "processing" (script "no processing")) #\-)
'(body "pro-cess-ing" (script "no pro-cess-ing"))              
> (hyphenate '(body "processing" (script "no processing")) #\- 
  #:omit-txexpr (λ(tx) (member (get-tag tx) '(script))))       
'(body "pro-cess-ing" (script "no processing"))                

You can also use #:omit-txexpr to omit tagged X-expressions with particular attributes. This can be used to selectively suppress hyphenation at the markup level.

Examples:

>                                                                        
(hyphenate '(p (span "processing") (span [[klh "no"]] "processing")) #\-)
'(p (span "pro-cess-ing") (span ((klh "no")) "pro-cess-ing"))            
>                                                                        
(hyphenate '(p (span "processing") (span [[klh "no"]] "processing")) #\- 
  #:omit-txexpr (λ(tx) (and (attrs-have-key? tx 'klh)                    
  (equal? (attr-ref tx 'klh) "no"))))                                    
'(p (span "pro-cess-ing") (span ((klh "no")) "processing"))              

Similarly, you can use the #:omit-word argument to avoid words that match word-test. Convenient if you want to prevent hyphenation of certain sets of words, like proper names:

Examples:

> (hyphenate "Brennan Huff likes fancy sauce" #\-)                  
"Bren-nan Huff likes fan-cy sauce"                                  
> (define capitalized? (λ(word) (let ([letter (substring word 0 1)])
  (equal? letter (string-upcase letter)))))                         
> (hyphenate "Brennan Huff likes fancy                              
sauce" #\- #:omit-word capitalized?)                                
"Brennan Huff likes fan-cy sauce"                                   

Sometimes you need #:omit-word to prevent unintended consequences. For instance, if youre using ligatures in CSS, certain groups of characters fi, fl, ffi, et al. will be replaced by a single glyph. That looks snazzy, but adding soft hyphens between any of these pairs will defeat the ligature substitution, creating inconsistent results. With #:omit-word, you can skip these words:

“Wouldnt it be better to exclude certain pairs of letters rather than whole words?” Yes. But for now, thats not supported.

Examples:

> (hyphenate "Hufflepuff golfing final on Tuesday" #\-)
"Huf-flepuff golf-ing fi-nal on Tues-day"              
> (define (ligs? word)                                 
    (ormap (λ(lig) (regexp-match lig word))            
    '("ff" "fi" "fl" "ffi" "ffl")))                    
> (hyphenate "Hufflepuff golfing final on              
Tuesday" #\- #:omit-word ligs?)                        
"Hufflepuff golfing final on Tues-day"                 
(unhyphenate  xexpr                                  
             [joiner                                 
              #:omit-word word-test                  
              #:omit-string string-test              
              #:omit-txexpr txexpr-test]) -> xexpr/c 
  xexpr : xexpr/c                                    
  joiner : (or/c char? string?) = (integer->char 173)
  word-test : (string? . -> . any/c) = (λ(x) #f)     
  string-test : (string? . -> . any/c) = (λ(x) #f)   
  txexpr-test : (txexpr? . -> . any/c) = (λ(x) #f)   

Remove joiner from xexpr. Like hyphenate, it works on nested X-expressions, and offers the same #:omit- options.

Examples:

> (hyphenate '(p "strangely" (em "formatted" (strong "snowmen"))) #\-)    
'(p "strange-ly" (em "for-mat-ted" (strong "snow-men")))                  
>                                                                         
(unhyphenate '(p "strange-ly" (em "for-mat-ted" (strong "snow-men"))) #\-)
'(p "strangely" (em "formatted" (strong "snowmen")))                      

A side effect of using hyphenate is that soft hyphens (or whatever the joiner is) will be embedded in the output text. If you need to support copying of text, for instance in a GUI application, youll probably want to strip out the hyphenation before the copied text is moved to the clipboard.

Examples:

> (hyphenate "ribbon-cutting ceremony")                
"rib\u00ADbon-cut\u00ADting cer\u00ADe\u00ADmo\u00ADny"
> (unhyphenate (hyphenate "ribbon-cutting ceremony"))  
"ribbon-cutting ceremony"                              

Use this function cautiously — if joiner appeared in the original input to hyphenate, the output from unhyphenate wont be the same string.

Examples:

> (hyphenate "ribbon-cutting ceremony" #\-)                  
"rib-bon-cut-ting cer-e-mo-ny"                               
> (unhyphenate (hyphenate "ribbon-cutting ceremony" #\-) #\-)
"ribboncutting ceremony"                                     

Keep in mind that soft hyphens could appear in your input string. Certain word processors allow users to insert soft hyphens in their text.

Examples:

> (hyphenate "True\u00ADType typefaces")                  
"True\u00ADType type\u00ADfaces"                          
> (unhyphenate (hyphenate "True\u00ADType typefaces"))    
"TrueType typefaces"                                      
> (hyphenate (unhyphenate "True\u00ADType typefaces") #\-)
"True-Type type-faces"                                    

4. French

 (require hyphenate/fr)               package: [hyphenate](https://pkgs.racket-lang.org/package/hyphenate)
 (require (submod hyphenate/fr safe))

French hyphenation is available by importing the module as hyphenate/fr or (submod hyphenate/fr safe) and using the hyphenate function normally. Below, notice that the word “formidable” hyphenates differently in French.

Examples:

> (hyphenate "formidable" #\-)   
"for-mi-da-ble"                  
> (module fr racket/base         
    (require hyphenate/fr)       
    (hyphenate "formidable" #\-))
> (require 'fr)                  
"for-mi-dable"                   

The two languages are in separate submodules for performance reasons. That way, they can maintain separate caches of hyphenated words.

There is no way to use hyphenate in “polyglot” mode, where English and French are detected automatically. It is possible, however, to mix both the English and French hyphenate functions in a single file, and apply them as needed. To avoid a name conflict between the two hyphenate functions, youll need to use prefix-in:

Examples:

> (require (prefix-in fr: hyphenate/fr))
> (hyphenate "formidable" #\-)          
"for-mi-da-ble"                         
> (fr:hyphenate "formidable" #\-)       
"for-mi-dable"                          

5. Russian

 (require hyphenate/ru)               package: [hyphenate](https://pkgs.racket-lang.org/package/hyphenate)
 (require (submod hyphenate/ru safe))

Russian hyphenation is available by importing the module as hyphenate/ru or (submod hyphenate/ru safe) and using the hyphenate function normally. (Hat tip to Natanael de Kross for finding the patterns, originally created by Alexander I. Lebedev.)

6. License & source code

This module is licensed under the LGPL.

Source repository at http://github.com/mbutterick/hyphenate. Suggestions & corrections welcome.