Incorrect conversion of quotes in an Urdu string #266

Closed
opened 1 year ago by saadatm · 5 comments
saadatm commented 1 year ago (Migrated from github.com)
Owner

Found a case in which smart-quotes incorrectly converts straight quotes of an Urdu string. It happens when the closing quote is immediately followed by an Urdu full stop (or some other Urdu punctuation mark):

; Correct
> (define str-en "This is \"a sentence\".")
> (display (smart-quotes str-en))
This is “a sentence”.

; Incorrect
> (define str-ur "یہ ایک \"جملہ ہے\"۔")
> (display (smart-quotes str-ur #:double-open "”" #:double-close "“"))
یہ ایک ”جملہ ہے”۔

The result should have been یہ ایک ”جملہ ہے“۔. Note that str-ur ends with U+06D4 ARABIC FULL STOP, which is the character for ending sentences in Urdu.

Interestingly, if we end the Urdu string with the English full stop (i.e. U+002E FULL STOP), the result is correct:

> (define str-ur-2 "یہ ایک \"جملہ ہے\".")
> (display (smart-quotes str-ur-2 #:double-open "”" #:double-close "“"))
یہ ایک ”جملہ ہے“.

(Sidenote: The output appears a bit weird due to the directionality of the characters. Here it is after applying the right-to-left direction: یہ ایک ”جملہ ہے“.)

I was expecting that if we ended the English string with the Urdu full stop, then the output would be incorrect, but it actually turned out to be correct:

> (define str-en-2 "This is \"a sentence\"۔")
> (display (smart-quotes str-en-2))
This is “a sentence”۔

So I tested with the combinations of:

  1. an English string,
  2. an Urdu string,
  3. English full stop, comma, question mark, and semicolon, and
  4. Urdu full stop, comma, question mark, and semicolon

... (with [o] and [c] acting as opening and closing curly quotes respectively for simplification):

> (for* ([str '("This is \"a sentence\"" "یہ ایک \"جملہ ہے\"")]
         [punctuation '("." "," "?" ";" "۔" "،" "؟" "؛")])
      (display (smart-quotes (string-append str punctuation "\n")
                             #:double-open "[o]" #:double-close "[c]")))
This is [o]a sentence[c].
This is [o]a sentence[c],
This is [o]a sentence[c]?
This is [o]a sentence[c];
This is [o]a sentence[c]۔
This is [o]a sentence[c]،
This is [o]a sentence[c]؟
This is [o]a sentence[c]؛
یہ ایک [o]جملہ ہے[c].
یہ ایک [o]جملہ ہے[c],
یہ ایک [o]جملہ ہے[c]?
یہ ایک [o]جملہ ہے[c];
یہ ایک [o]جملہ ہے[o]۔
یہ ایک [o]جملہ ہے[o]،
یہ ایک [o]جملہ ہے[o]؟
یہ ایک [o]جملہ ہے[o]؛

As the results are showing, the output is incorrect only when the Urdu string ends with Urdu punctuation. Not sure why this is happening, though.

Found a case in which `smart-quotes` incorrectly converts straight quotes of an Urdu string. It happens when the closing quote is immediately followed by an Urdu full stop (or some other Urdu punctuation mark): ```racket ; Correct > (define str-en "This is \"a sentence\".") > (display (smart-quotes str-en)) This is “a sentence”. ; Incorrect > (define str-ur "یہ ایک \"جملہ ہے\"۔") > (display (smart-quotes str-ur #:double-open "”" #:double-close "“")) یہ ایک ”جملہ ہے”۔ ``` The result should have been `یہ ایک ”جملہ ہے“۔`. Note that `str-ur` ends with `U+06D4 ARABIC FULL STOP`, which is the character for ending sentences in Urdu. Interestingly, if we end the Urdu string with the English full stop (i.e. `U+002E FULL STOP`), the result is correct: ```racket > (define str-ur-2 "یہ ایک \"جملہ ہے\".") > (display (smart-quotes str-ur-2 #:double-open "”" #:double-close "“")) یہ ایک ”جملہ ہے“. ``` (Sidenote: The output appears a bit weird due to the directionality of the characters. Here it is after applying the right-to-left direction: <code dir="rtl">یہ ایک ”جملہ ہے“.</code>) I was expecting that if we ended the English string with the Urdu full stop, then the output would be incorrect, but it actually turned out to be correct: ```racket > (define str-en-2 "This is \"a sentence\"۔") > (display (smart-quotes str-en-2)) This is “a sentence”۔ ``` So I tested with the combinations of: 1. an English string, 2. an Urdu string, 3. English full stop, comma, question mark, and semicolon, and 4. Urdu full stop, comma, question mark, and semicolon ... (with `[o]` and `[c]` acting as opening and closing curly quotes respectively for simplification): ```racket > (for* ([str '("This is \"a sentence\"" "یہ ایک \"جملہ ہے\"")] [punctuation '("." "," "?" ";" "۔" "،" "؟" "؛")]) (display (smart-quotes (string-append str punctuation "\n") #:double-open "[o]" #:double-close "[c]"))) This is [o]a sentence[c]. This is [o]a sentence[c], This is [o]a sentence[c]? This is [o]a sentence[c]; This is [o]a sentence[c]۔ This is [o]a sentence[c]، This is [o]a sentence[c]؟ This is [o]a sentence[c]؛ یہ ایک [o]جملہ ہے[c]. یہ ایک [o]جملہ ہے[c], یہ ایک [o]جملہ ہے[c]? یہ ایک [o]جملہ ہے[c]; یہ ایک [o]جملہ ہے[o]۔ یہ ایک [o]جملہ ہے[o]، یہ ایک [o]جملہ ہے[o]؟ یہ ایک [o]جملہ ہے[o]؛ ``` As the results are showing, the output is incorrect only when the Urdu string ends with Urdu punctuation. Not sure why this is happening, though.
mbutterick commented 1 year ago (Migrated from github.com)
Owner

smart-quotes isn’t part of the supported public interface for Pollen because it makes no attempt to behave well beyond the easy cases. (That’s why it’s in the unstable directory.)

I suggest taking the existing code as a starting point and making a function that works better for your project.

`smart-quotes` isn’t part of the supported public interface for Pollen because it makes no attempt to behave well beyond the easy cases. (That’s why it’s in the `unstable` directory.) I suggest taking [the existing code](https://github.com/mbutterick/pollen/blob/816ce0f7af739b09dc0d64852d905ece24662bde/pollen/unstable/typography.rkt#L59) as a starting point and making a function that works better for your project.
saadatm commented 1 year ago (Migrated from github.com)
Owner

Thanks. I fiddled with the code, and adding Urdu punctuation marks in sentence-ender-exceptions fixed the issue.

I'll be using the modified version in my project, but how does adding a new keyword argument to smart-quotes for passing additional punctuation marks sound to you? It can be an empty string by default, and whatever the user passes in it can be appended to the regex being used in sentence-ender-exceptions. I'll be glad to open a PR if you think it's a good idea. As far as I can tell, sentence-ender-exceptions is not used anywhere other than smart-quotes.

Thanks. I fiddled with the code, and adding Urdu punctuation marks in [`sentence-ender-exceptions`](https://github.com/mbutterick/pollen/blob/816ce0f7af739b09dc0d64852d905ece24662bde/pollen/unstable/typography.rkt#L53) fixed the issue. I'll be using the modified version in my project, but how does adding a new keyword argument to `smart-quotes` for passing additional punctuation marks sound to you? It can be an empty string by default, and whatever the user passes in it can be appended to the regex being used in `sentence-ender-exceptions`. I'll be glad to open a PR if you think it's a good idea. As far as I can tell, `sentence-ender-exceptions` is not used anywhere other than `smart-quotes`.
mbutterick commented 1 year ago (Migrated from github.com)
Owner
  1. For your purposes, would it fix smart-quotes to add the Urdu full stop to the current value of sentence-ender-exceptions?

  2. Is there some Unicode character class that covers what sentence-ender-exceptions is attempting to be?

I’m not averse to the PR you propose, but I think sentence-ender-exceptions is the wrong way to do things, and building a public interface around it doesn’t make it less wrong.

1) For your purposes, would it fix `smart-quotes` to add the Urdu full stop to the current value of `sentence-ender-exceptions`? 2) Is there some Unicode character class that covers what `sentence-ender-exceptions` is attempting to be? I’m not averse to the PR you propose, but I think `sentence-ender-exceptions` is the wrong way to do things, and building a public interface around it doesn’t make it less wrong.
saadatm commented 1 year ago (Migrated from github.com)
Owner
  1. Yes. But not just the Urdu full stop — Urdu comma (،), Urdu question mark (؟), and Urdu semicolon (؛) too.
  2. I am not sure. Maybe Punctuation, Other (Po) (minus the straight quotes); Punctuation, Open (Ps); and Punctuation, Close (Pe)?
1. Yes. But not just the Urdu full stop — Urdu comma (`،`), Urdu question mark (`؟`), and Urdu semicolon (`؛`) too. 2. I am not sure. Maybe [Punctuation, Other (Po)](https://www.fileformat.info/info/unicode/category/Po/list.htm) (minus the straight quotes); [Punctuation, Open (Ps)](https://www.fileformat.info/info/unicode/category/Ps/list.htm); and [Punctuation, Close (Pe)](https://www.fileformat.info/info/unicode/category/Pe/index.htm)?
saadatm commented 1 year ago (Migrated from github.com)
Owner

I have added a custom smart quotes function (based on the original) in my project. Thanks for the discussion. :-)

I have added a custom smart quotes function (based on the original) in my project. Thanks for the discussion. :-)
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: mbutterick/pollen#266
Loading…
There is no content yet.