[quad] word segmentation and zero-width space #35

Open
opened 4 years ago by sorawee · 1 comments
sorawee commented 4 years ago (Migrated from github.com)

TLDR: is there zero-width space in quad?

In some non-English languages such as Thai, there are no word boundaries. Particularly, whitespace is not a word boundary, but it is a sentence separator. This causes quad to enter a new line only at a new sentence, which is not optimal. The problem is described with more technical details here.

There are existing tools that help with this problem, notably Swath. So my current solution is to traverse the document tree and replace each string with outputs from Swath. However, I need zero-width space to glue these segmented words together. Is there a way to input it?

More generally, I want to ask if this is a good approach. I understand that quad is meant to be low-level, so the word segmentation problem might not be suitable to resolve at this level. Yet IMO, it also doesn't make sense to leave the problem, which is quite low-level, to users.

TLDR: is there zero-width space in quad? In some non-English languages such as Thai, there are no word boundaries. Particularly, whitespace is **not** a word boundary, but it is a sentence separator. This causes quad to enter a new line only at a new sentence, which is not optimal. The problem is described with more technical details [here](http://www.cs.cmu.edu/~paisarn/papers/nlprs97.pdf). There are existing tools that help with this problem, notably [Swath](http://manpages.ubuntu.com/manpages/bionic/man1/swath.1.html). So my current solution is to traverse the document tree and replace each string with outputs from Swath. However, I need [zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) to glue these segmented words together. Is there a way to input it? More generally, I want to ask if this is a good approach. I understand that quad is meant to be low-level, so the word segmentation problem might not be suitable to resolve at this level. Yet IMO, it also doesn't make sense to leave the problem, which is quite low-level, to users.
mbutterick commented 4 years ago (Migrated from github.com)

IIUC the general & correct answer is to implement the Unicode linebreaking algorithm, which respects the zero-width space. (I started a version of it over here, but have not made progress. If someone wanted a manageable, self-contained contribution to Quad, that would be a good one.)

In the interim you can add the zero-width space to the list of softies in the line wrapper and see if it does what you want — if so you can make a PR.

IIUC the general & correct answer is to implement the Unicode linebreaking algorithm, which respects the zero-width space. (I started a version of it [over here](https://github.com/mbutterick/unicode-linebreak), but have not made progress. If someone wanted a manageable, self-contained contribution to Quad, that would be a good one.) In the interim you can add the zero-width space to the [list of `softies` in the line wrapper](https://github.com/mbutterick/quad/blob/master/quadwriter/line.rkt#L249) and see if it does what you want — if so you can make a PR.
This repo is archived. You cannot comment on issues.
No Milestone
No project
No Assignees
1 Participants
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mbutterick/pollen-users#35
Loading…
There is no content yet.