Hyphenate
A simple hyphenation engine that uses the Knuth–Liang hyphenation algorithm originally developed for TeX. This implementation is a port of Ned Batchelder’s Python version. I have added little to their work. Accordingly, I take little credit.
I originally put together this module to handle hyphenation for my web-based book Butterick’s Practical Typography (which I made with Racket & Scribble). Though support for CSS-based hyphenation in web browsers is still iffy, soft hyphens work reliably well. But putting them into the text manually is a drag. And thus a module was born.
1 Installation
raco pkg install hyphenate |
2 Interface
(require hyphenate) |
The REPL displays a soft hyphen as \u00AD. But in ordinary use, you’ll only see a soft hyphen when it appears at the end of a line or page as part of a hyphenated word. Otherwise it’s not displayed. In most of the examples here, I use a standard hyphen for clarity.
Examples: | ||||||||
|
Because the hyphenation is based on an algorithm rather than a dictionary, it makes good guesses with unusual words:
Examples: | ||||||
|
Using the #:exceptions keyword, you can pass hyphenation exceptions as a list of words with hyphenation points marked with regular hyphens ("-"). If an exception word contains no hyphens, that word will never be hyphenated.
Examples: | ||||||
|
Knuth & Liang were sufficiently confident about their algorithm that they originally released it with only 14 exceptions: associate[s], declination, obligatory, philanthropic, present[s], project[s], reciprocity, recognizance, reformation, retribution, and table. Admirable bravado, but it’s not hard to discover others.
Examples: | ||||||
|
Overall, my impression is that the Knuth–Liang algorithm is more likely to miss legitimate hyphenation points (i.e., generate false negatives) than create erroneous hyphenation points (i.e., false positives). This is good policy. Perfect hyphenation — that is, hyphenation that represents an exact linguistic syllabification of each word — is hardly useful in typesetting contexts. Hyphenation simply seeks to mark possible line-break and page-break locations for whatever layout engine is drawing the text. The ultimate goal is to permit more even text flow. Like horseshoes and hand grenades, close is good enough. And a word wrongly hyphenated is more likely noticed by a reader than a word inefficiently hyphenated.
For this reason, certain words can’t be hyphenated algorithmically, because the correct hyphenation depends on meaning, not merely on spelling. For instance:
Example: | ||
|
This is the right result. If you used adder to mean the machine, it would be hyphenated add-er; if you meant the snake, it would be ad-der. Better to avoid hyphenation than to hyphenate incorrectly.
Don’t send raw HTML through hyphenate. It can’t distinguish HTML tags and attributes from textual content, so it will hyphenate everything, which will goof up your file.
Example: | ||
|
Instead, send your textual content through hyphenate before you put it into your HTML template. Or convert your HTML to an X-expression and process it selectively (e.g., with match).
procedure
(hyphenatef text pred [ joiner #:exceptions exceptions #:min-length length]) → string? text : string? pred : procedure? joiner : (or/c char? string?) = (integer->char |#x00AD|) exceptions : (listof string?) = empty length : (or/c integer? false?) = 5
Examples: | |||||||||||
|
It’s possible to do fancier kinds of hyphenation restrictions that take account of context, like not hyphenating the last word of a paragraph. But hyphenatef only operates on words. So you’ll have to write some fancier code. Separate out the hyphenatable words, and then send them through good old hyphenate.
procedure
text : string? joiner : (or/c char? string?) = (integer->char 173)
A side effect of using hyphenate is that soft hyphens (or whatever the joiner is) will be embedded in the output text. If you need to support copying of text, for instance in a GUI application, you’ll probably want to strip out the hyphenation before the copied text is moved to the clipboard.
Examples: | ||||
|
Use this function cautiously — if joiner appeared in the original input to hyphenate, the output from unhyphenate won’t be the same string.
Examples: | ||||
|
3 License & source code
This module is licensed under the LGPL.
Source repository at http://github.com/mbutterick/hyphenate.