[quad] word segmentation and zero-width space
#35
Open
opened 4 years ago by sorawee
·
1 comments
Loading…
Reference in New Issue
There is no content yet.
Delete Branch '%!s(<nil>)'
Deleting a branch is permanent. It CANNOT be undone. Continue?
TLDR: is there zero-width space in quad?
In some non-English languages such as Thai, there are no word boundaries. Particularly, whitespace is not a word boundary, but it is a sentence separator. This causes quad to enter a new line only at a new sentence, which is not optimal. The problem is described with more technical details here.
There are existing tools that help with this problem, notably Swath. So my current solution is to traverse the document tree and replace each string with outputs from Swath. However, I need zero-width space to glue these segmented words together. Is there a way to input it?
More generally, I want to ask if this is a good approach. I understand that quad is meant to be low-level, so the word segmentation problem might not be suitable to resolve at this level. Yet IMO, it also doesn't make sense to leave the problem, which is quite low-level, to users.
IIUC the general & correct answer is to implement the Unicode linebreaking algorithm, which respects the zero-width space. (I started a version of it over here, but have not made progress. If someone wanted a manageable, self-contained contribution to Quad, that would be a good one.)
In the interim you can add the zero-width space to the list of
softies
in the line wrapper and see if it does what you want — if so you can make a PR.