From 9695ac81d5bc2f5c80ecce79cb96fc0dfd3ac8bb Mon Sep 17 00:00:00 2001 From: Matthew Butterick Date: Mon, 10 Feb 2014 16:31:43 -0800 Subject: [PATCH] updates --- hyphenate/scribblings/hyphenate.html | 3 ++- hyphenate/scribblings/hyphenate.scrbl | 33 ++++++++++++++++++++------- 2 files changed, 27 insertions(+), 9 deletions(-) diff --git a/hyphenate/scribblings/hyphenate.html b/hyphenate/scribblings/hyphenate.html index e37d43a6..762cb423 100644 --- a/hyphenate/scribblings/hyphenate.html +++ b/hyphenate/scribblings/hyphenate.html @@ -1,2 +1,3 @@ -Hyphenate
6.0.0.1

Hyphenate

Matthew Butterick (mb@mbtype.com)

 (require hyphenate) package: hyphenate

A simple hyphenation module that uses the Knuth–Liang hyphenation algorithm and patterns originally developed for TeX. This implementation was ported from Ned Batchelder’s Python version.

I originally developed this module to handle hyphenation for my web-based book Butterick’s Practical Typography. Even though support for CSS-based hyphenation is still iffy among web browsers, soft hyphens work reliably.

1 How to use it

2 Interface

procedure

(hyphenate text    
  [joiner    
  #:exceptions exceptions    
  #:min-length length])  string?
  text : string?
  joiner : (or/c char? string?) = (integer->char 173)
  exceptions : (listof string?) = empty
  length : (or/c integer? false?) = 5
Hyphenate text by calculating hyphenation points and inserting joiner at those points. By default, joiner is the soft hyphen. Words shorter than length will not be hyphenated. To hyphenate words of any length, use #:min-length #f.

The REPL will display a soft hyphen as #\u00AD. But in ordinary use, you only see a soft hyphen when it appears at the end of a line or page as part of a hyphenated word. Otherwise it’s invisible.

Using the #:exceptions keyword, you can pass hyphenation exceptions as a list of words with regular hyphen characters ("-") marking the permissible hyphenation points. If an exception word contains no hyphens, that word will never be hyphenated.

Examples:

> (hyphenate "polymorphism" #\-)

hyphenate: undefined;

 cannot reference undefined identifier

> (hyphenate "polymorphism" #\- #:exceptions '("polymo-rphism"))

hyphenate: undefined;

 cannot reference undefined identifier

> (hyphenate "polymorphism" #\- #:exceptions '("polymorphism"))

hyphenate: undefined;

 cannot reference undefined identifier

Knuth & Liang were sufficiently confident about their algorithm that they originally released it with only 14 exceptions: associate[s], declination, obligatory, philanthropic, present[s], project[s], reciprocity, recognizance, reformation, retribution, and table. While their bravado is admirable, it’s easy to discover words they missed.

Don’t send raw HTML through hyphenate. It can’t distinguish HTML tags and attributes from textual content, but it will hyphenate them anyhow, which will break the markup. Run your textual content through hyphenate before you put it into your page template. Or convert your HTML to an X-expression and process it selectively.

procedure

(hyphenatef text    
  pred    
  [joiner    
  #:exceptions exceptions    
  #:min-length length])  string?
  text : string?
  pred : procedure?
  joiner : (or/c char? string?) = (integer->char 173)
  exceptions : (listof string?) = empty
  length : (or/c integer? false?) = 5
Like hyphenate, but only words matching pred are hyphenated. Convenient if you want to filter out, say, capitalized words.

procedure

(unhyphenate text [joiner])  string?

  text : string?
  joiner : (or/c char? string?) = (integer->char 173)
Remove joiner from text. Essentially equivalent to (string-replace text joiner "").

A side effect of using hyphenate is that soft hyphens (or whatever the joiner is) are being embedded in the output text. If you’re building an application that needs to support, for instance, copying of text in a graphical interface, you probably want to strip out the hyphenation before the copied text is moved to the clipboard.

Keep in mind, however, that unhyphenate won’t produce the input originally passed to hyphenate if the joiner was part of the original input text.

 
\ No newline at end of file +Hyphenate
6.0.0.1

Hyphenate

Matthew Butterick <mb@mbtype.com>

A simple hyphenation engine that uses the Knuth–Liang hyphenation algorithm originally developed for TeX. This implementation is a port of Ned Batchelder’s Python version. I have added little to their work. Accordingly, I take little credit.

I originally put together this module to handle hyphenation for my web-based book Butterick’s Practical Typography (which I made with Racket & Scribble). Though support for CSS-based hyphenation in web browsers is still iffy, soft hyphens work reliably well. But putting them into the text manually is a drag. And thus a module was born.

1 Installation

At the command line: +

raco pkg install hyphenate

2 Interface

 (require hyphenate)

procedure

(hyphenate text    
  [joiner    
  #:exceptions exceptions    
  #:min-length length])  string?
  text : string?
  joiner : (or/c char? string?) = (integer->char 173)
  exceptions : (listof string?) = empty
  length : (or/c integer? false?) = 5
Hyphenate text by calculating hyphenation points and inserting joiner at those points. By default, joiner is the soft hyphen (Unicode 00AD = decimal 173). Words shorter than #:min-length length will not be hyphenated. To hyphenate words of any length, use #:min-length #f.

The REPL displays a soft hyphen as \u00AD. But in ordinary use, you’ll only see a soft hyphen when it appears at the end of a line or page as part of a hyphenated word. Otherwise it’s not displayed. In most of the examples here, I use a standard hyphen for clarity.

Examples:

> (hyphenate "ergo polymorphic")

"ergo poly\u00ADmor\u00ADphic"

> (hyphenate "ergo polymorphic" #\-)

"ergo poly-mor-phic"

> (hyphenate "ergo polymorphic" #:min-length 13)

"ergo polymorphic"

> (hyphenate "ergo polymorphic" #:min-length #f)

"er\u00ADgo poly\u00ADmor\u00ADphic"

Because the hyphenation is based on an algorithm rather than a dictionary, it makes good guesses with unusual words:

Examples:

> (hyphenate "scraunched strengths" #\-)

"scraunched strengths"

> (hyphenate "Racketcon" #\-)

"Rack-et-con"

> (hyphenate "supercalifragilisticexpialidocious" #\-)

"su-per-cal-ifrag-ilis-tic-ex-pi-ali-do-cious"

Using the #:exceptions keyword, you can pass hyphenation exceptions as a list of words with hyphenation points marked with regular hyphens ("-"). If an exception word contains no hyphens, that word will never be hyphenated.

Examples:

> (hyphenate "polymorphic" #\-)

"poly-mor-phic"

> (hyphenate "polymorphic" #\- #:exceptions '("polymo-rphic"))

"polymo-rphic"

> (hyphenate "polymorphic" #\- #:exceptions '("polymorphic"))

"polymorphic"

Knuth & Liang were sufficiently confident about their algorithm that they originally released it with only 14 exceptions: associate[s], declination, obligatory, philanthropic, present[s], project[s], reciprocity, recognizance, reformation, retribution, and table. Admirable bravado, but it’s not hard to discover others.

Examples:

> (hyphenate "wrong: columns signage lawyers" #\-)

"wrong: columns sig-nage lawyers"

> (hyphenate "right: columns signage lawyers" #\-
  #:exceptions '("col-umns" "sign-age" "law-yers"))

"right: col-umns sign-age law-yers"

Overall, my impression is that the Knuth–Liang algorithm is more likely to miss legitimate hyphenation points (i.e., generate false negatives) than create erroneous hyphenation points (i.e., false positives). This is good policy. Perfect hyphenation — that is, hyphenation that represents an exact linguistic syllabification of each word — is hardly useful in typesetting contexts. Hyphenation simply seeks to mark possible line-break and page-break locations for whatever layout engine is drawing the text. The ultimate goal is to permit more even text flow. Like horseshoes and hand grenades, close is good enough. And a word wrongly hyphenated is more likely noticed by a reader than a word inefficiently hyphenated.

For this reason, certain words can’t be hyphenated algorithmically, because the correct hyphenation depends on meaning, not merely on spelling. For instance:

Example:

> (hyphenate "adder")

"adder"

This is the right result. If you used adder to mean the machine, it would be hyphenated add-er; if you meant the snake, it would be ad-der. Better to avoid hyphenation than to hyphenate incorrectly.

Don’t send raw HTML through hyphenate. It can’t distinguish HTML tags and attributes from textual content, so it will hyphenate everything, which will goof up your file.

Example:

> (hyphenate "<body style=\"background: yellow\">Hello world</body>")

"<body style=\"back\u00ADground: yel\u00ADlow\">Hel\u00ADlo world</body>"

Instead, send your textual content through hyphenate before you put it into your HTML template. Or convert your HTML to an X-expression and process it selectively (e.g., with match).

procedure

(hyphenatef text    
  pred    
  [joiner    
  #:exceptions exceptions    
  #:min-length length])  string?
  text : string?
  pred : procedure?
  joiner : (or/c char? string?) = (integer->char |#x00AD|)
  exceptions : (listof string?) = empty
  length : (or/c integer? false?) = 5
Like hyphenate, but only words matching pred are hyphenated. Convenient if you want to prevent hyphenation of certain sets of words, like proper names:

Examples:

> (hyphenate "Brennan Huff likes fancy sauce" #\-)

"Bren-nan Huff likes fan-cy sauce"

> (define uncapitalized? (λ(word) (let ([letter (substring word 0 1)])
  (equal? letter (string-downcase letter)))))
> (hyphenatef "Brennan Huff likes fancy sauce" uncapitalized?
   #\-)

"Brennan Huff likes fan-cy sauce"

It’s possible to do fancier kinds of hyphenation restrictions that take account of context, like not hyphenating the last word of a paragraph. But hyphenatef only operates on words. So you’ll have to write some fancier code. Separate out the hyphenatable words, and then send them through good old hyphenate.

procedure

(unhyphenate text [joiner])  string?

  text : string?
  joiner : (or/c char? string?) = (integer->char 173)
Remove joiner from text using string-replace.

A side effect of using hyphenate is that soft hyphens (or whatever the joiner is) will be embedded in the output text. If you need to support copying of text, for instance in a GUI application, you’ll probably want to strip out the hyphenation before the copied text is moved to the clipboard.

Examples:

> (hyphenate "ribbon-cutting ceremony")

"rib\u00ADbon-cut\u00ADting cer\u00ADe\u00ADmo\u00ADny"

> (unhyphenate (hyphenate "ribbon-cutting ceremony"))

"ribbon-cutting ceremony"

Use this function cautiously — if joiner appeared in the original input to hyphenate, the output from unhyphenate won’t be the same string.

Examples:

> (hyphenate "ribbon-cutting ceremony" #\-)

"rib-bon-cut-ting cer-e-mo-ny"

> (unhyphenate (hyphenate "ribbon-cutting ceremony" #\-) #\-)

"ribboncutting ceremony"

3 License & source code

This module is licensed under the LGPL.

Source repository at http://github.com/mbutterick/hyphenate.

 
\ No newline at end of file diff --git a/hyphenate/scribblings/hyphenate.scrbl b/hyphenate/scribblings/hyphenate.scrbl index e04afc1c..074bc892 100644 --- a/hyphenate/scribblings/hyphenate.scrbl +++ b/hyphenate/scribblings/hyphenate.scrbl @@ -8,11 +8,18 @@ @title{Hyphenate} -@author{Matthew Butterick (mb@"@"mbtype.com)} +@author[(author+email "Matthew Butterick" "mb@mbtype.com")] -A simple hyphenation engine that uses the Knuth–Liang hyphenation algorithm originally developed for TeX. This implementation is a port of Ned Batchelder's @link["http://nedbatchelder.com/code/modules/hyphenate.html"]{Python version}. I can claim only the most inconsequential shred of authorial credit. +A simple hyphenation engine that uses the Knuth–Liang hyphenation algorithm originally developed for TeX. This implementation is a port of Ned Batchelder's @link["http://nedbatchelder.com/code/modules/hyphenate.html"]{Python version}. I have added little to their work. Accordingly, I take little credit. -I originally developed this module to handle hyphenation for my web-based book @link["http://practicaltypography.com"]{Butterick's Practical Typography}. Among web browsers, support for CSS-based hyphenation is still iffy, but soft hyphens work reliably well. Putting them into the text manually, however, is a drag. Hence @racketmodname[hyphenate]. +I originally put together this module to handle hyphenation for my web-based book @link["http://practicaltypography.com"]{Butterick's Practical Typography} (which I made with @tech{Racket} & @tech{Scribble}). Though support for CSS-based hyphenation in web browsers is @link["http://caniuse.com/#search=hyphen"]{still iffy}, soft hyphens work reliably well. But putting them into the text manually is a drag. And thus a module was born. + +@section{Installation} + +At the command line: +@verbatim{raco pkg install hyphenate} + +@section{Interface} @defmodule[hyphenate] @@ -43,7 +50,7 @@ Because the hyphenation is based on an algorithm rather than a dictionary, it ma (hyphenate "supercalifragilisticexpialidocious" #\-) ] -Using the @racket[#:exceptions] keyword, you can pass hyphenation exceptions as a list of words with permissible hyphenation points marked with regular hyphen characters (@racket["-"]). If an exception word contains no hyphens, that word will never be hyphenated. +Using the @racket[#:exceptions] keyword, you can pass hyphenation exceptions as a list of words with hyphenation points marked with regular hyphens (@racket["-"]). If an exception word contains no hyphens, that word will never be hyphenated. @examples[#:eval my-eval (hyphenate "polymorphic" #\-) @@ -59,7 +66,7 @@ Knuth & Liang were sufficiently confident about their algorithm that they origin #:exceptions '("col-umns" "sign-age" "law-yers")) ] -Overall, my impression is that the Knuth–Liang algorithm tends to miss legitimate hyphenation points (i.e., it generates false negatives) more often than it creates erroneous hyphenation points (i.e., false positives). This is good policy. Perfect hyphenation — that is, hyphenation that represents an exact linguistic syllabification of each word — is hardly useful in typesetting contexts. Hyphenation simply seeks to mark possible line-break and page-break locations for whatever text-layout engine is drawing the text. A word wrongly hyphenated is more likely noticed by a reader than a word inefficiently hyphenated. +Overall, my impression is that the Knuth–Liang algorithm is more likely to miss legitimate hyphenation points (i.e., generate false negatives) than create erroneous hyphenation points (i.e., false positives). This is good policy. Perfect hyphenation — that is, hyphenation that represents an exact linguistic syllabification of each word — is hardly useful in typesetting contexts. Hyphenation simply seeks to mark possible line-break and page-break locations for whatever layout engine is drawing the text. The ultimate goal is to permit more even text flow. Like horseshoes and hand grenades, close is good enough. And a word wrongly hyphenated is more likely noticed by a reader than a word inefficiently hyphenated. For this reason, certain words can't be hyphenated algorithmically, because the correct hyphenation depends on meaning, not merely on spelling. For instance: @@ -70,13 +77,13 @@ For this reason, certain words can't be hyphenated algorithmically, because the This is the right result. If you used @italic{adder} to mean the machine, it would be hyphenated @italic{add-er}; if you meant the snake, it would be @italic{ad-der}. Better to avoid hyphenation than to hyphenate incorrectly. -Don't send raw HTML through @racket[hyphenate]. It can't distinguish HTML tags and attributes from textual content, so it will hyphenate everything, breaking your markup. +Don't send raw HTML through @racket[hyphenate]. It can't distinguish HTML tags and attributes from textual content, so it will hyphenate everything, which will goof up your file. @examples[#:eval my-eval (hyphenate "Hello world") ] -So pass your textual content through @racket[hyphenate] @italic{before} you put it into your HTML template. Or convert your HTML to an @tech{X-expression} and process it selectively (e.g., with @racket[match]). +Instead, send your textual content through @racket[hyphenate] @italic{before} you put it into your HTML template. Or convert your HTML to an @tech{X-expression} and process it selectively (e.g., with @racket[match]). @defproc[ (hyphenatef @@ -95,6 +102,8 @@ Like @racket[hyphenate], but only words matching @racket[_pred] are hyphenated. (hyphenatef "Brennan Huff likes fancy sauce" uncapitalized? #\-) ] + +It's possible to do fancier kinds of hyphenation restrictions that take account of context, like not hyphenating the last word of a paragraph. But @racket[hyphenatef] only operates on words. So you'll have to write some fancier code. Separate out the hyphenatable words, and then send them through good old @racket[hyphenate]. @defproc[ (unhyphenate @@ -110,9 +119,17 @@ A side effect of using @racket[hyphenate] is that soft hyphens (or whatever the (unhyphenate (hyphenate "ribbon-cutting ceremony")) ] -Keep in mind that @racket[unhyphenate] won't produce the input originally passed to @racket[hyphenate] if @racket[_joiner] appeared in the original input. +Use this function cautiously — if @racket[_joiner] appeared in the original input to @racket[hyphenate], the output from @racket[unhyphenate] won't be the same string. @examples[#:eval my-eval (hyphenate "ribbon-cutting ceremony" #\-) (unhyphenate (hyphenate "ribbon-cutting ceremony" #\-) #\-) ] + + +@section{License & source code} + +This module is licensed under the LGPL. + +Source repository at @link["http://github.com/mbutterick/hyphenate"]{http://github.com/mbutterick/hyphenate}. +