Incorrect conversion of quotes in an Urdu string
Closedopened 1 year ago by saadatm · 5 comments
Reference in New Issue
There is no content yet.
Delete Branch '%!s(<nil>)'
Deleting a branch is permanent. It CANNOT be undone. Continue?
Found a case in which
smart-quotesincorrectly converts straight quotes of an Urdu string. It happens when the closing quote is immediately followed by an Urdu full stop (or some other Urdu punctuation mark):
The result should have been
یہ ایک ”جملہ ہے“۔. Note that
U+06D4 ARABIC FULL STOP, which is the character for ending sentences in Urdu.
Interestingly, if we end the Urdu string with the English full stop (i.e.
U+002E FULL STOP), the result is correct:
(Sidenote: The output appears a bit weird due to the directionality of the characters. Here it is after applying the right-to-left direction:
یہ ایک ”جملہ ہے“.)
I was expecting that if we ended the English string with the Urdu full stop, then the output would be incorrect, but it actually turned out to be correct:
So I tested with the combinations of:
[c]acting as opening and closing curly quotes respectively for simplification):
As the results are showing, the output is incorrect only when the Urdu string ends with Urdu punctuation. Not sure why this is happening, though.
smart-quotesisn’t part of the supported public interface for Pollen because it makes no attempt to behave well beyond the easy cases. (That’s why it’s in the
I suggest taking the existing code as a starting point and making a function that works better for your project.
Thanks. I fiddled with the code, and adding Urdu punctuation marks in
sentence-ender-exceptionsfixed the issue.
I'll be using the modified version in my project, but how does adding a new keyword argument to
smart-quotesfor passing additional punctuation marks sound to you? It can be an empty string by default, and whatever the user passes in it can be appended to the regex being used in
sentence-ender-exceptions. I'll be glad to open a PR if you think it's a good idea. As far as I can tell,
sentence-ender-exceptionsis not used anywhere other than
For your purposes, would it fix
smart-quotesto add the Urdu full stop to the current value of
Is there some Unicode character class that covers what
sentence-ender-exceptionsis attempting to be?
I’m not averse to the PR you propose, but I think
sentence-ender-exceptionsis the wrong way to do things, and building a public interface around it doesn’t make it less wrong.
،), Urdu question mark (
؟), and Urdu semicolon (
I have added a custom smart quotes function (based on the original) in my project. Thanks for the discussion. :-)