default view in browser uses Western instead of Unicode encoding #44

Closed
opened 9 years ago by gour · 17 comments
gour commented 9 years ago (Migrated from github.com)

Hello,

doing my 1st steps with Pollen, but testing with the text having Croatian native characters (č,ć,đ,š,ž) and although my source file is using UTF-8:

$ file haribol.txt.pp 
haribol.txt.pp: UTF-8 Unicode text

stil when I render it in browser (Firefox) it is rendered using Western encoding and the text is not correct until I select Unicode encoding?

Strangely enough, rendering in the terminal generates correct file.

What's the matter?

Hello, doing my 1st steps with Pollen, but testing with the text having Croatian native characters (č,ć,đ,š,ž) and although my source file is using UTF-8: ``` $ file haribol.txt.pp haribol.txt.pp: UTF-8 Unicode text ``` stil when I render it in browser (Firefox) it is rendered using Western encoding and the text is not correct until I select Unicode encoding? Strangely enough, rendering in the terminal generates correct file. What's the matter?
mbutterick commented 9 years ago (Migrated from github.com)

Under the HTTP 1.1 specification, browsers use ISO-8859-1 (Western) text encoding unless told otherwise (e.g. by an HTTP header set by the server, or an encoding header in the document).

Under the HTTP 1.1 specification, browsers use ISO-8859-1 (Western) text encoding unless told otherwise (e.g. by an HTTP header set by the server, or an encoding header in the document).
mbutterick commented 9 years ago (Migrated from github.com)

PS. The file command in the terminal doesn’t know the encoding either. But it uses a more elaborate set of tests to make an educated guess. As a rule, web browsers never do this — they only decode content according to explicit instructions.

PS. The `file` command in the terminal doesn’t know the encoding either. But it uses a more elaborate set of tests to make an educated guess. As a rule, web browsers never do this — they only decode content according to explicit instructions.
gour commented 9 years ago (Migrated from github.com)

Under the HTTP 1.1 specification, browsers use ISO-8859-1 (Western) text encoding unless told otherwise (e.g. by an HTTP header set by the server, or an encoding header in the document).

What about browser's settings to use Unicode encoding?

> Under the HTTP 1.1 specification, browsers use ISO-8859-1 (Western) text encoding unless told otherwise (e.g. by an HTTP header set by the server, or an encoding header in the document). What about browser's settings to use Unicode encoding?
mbutterick commented 9 years ago (Migrated from github.com)

That’s really more of a question about browser UI. I have no special expertise. As I understand Firefox, you can set a fallback encoding in the preferences that’s used when a file doesn’t declare an encoding. As for the “Character Encoding” option in the main menu, it’s unclear to me whether this merely displays the current page encoding, or allows you to override the encoding of all pages during a browser session (in my fiddling around with it, I haven’t detected a consistent pattern).

But the general point remains: a web browser expects a file to declare its own encoding.

That’s really more of a question about browser UI. I have no special expertise. As I understand Firefox, you can set a fallback encoding in the preferences that’s used when a file doesn’t declare an encoding. As for the “Character Encoding” option in the main menu, it’s unclear to me whether this merely displays the current page encoding, or allows you to override the encoding of all pages during a browser session (in my fiddling around with it, I haven’t detected a consistent pattern). But the general point remains: a web browser expects a file to declare its own encoding.
gour commented 9 years ago (Migrated from github.com)

OK. Thank you for your input.

OK. Thank you for your input.
jbaum98 commented 6 years ago (Migrated from github.com)

I think you can declare the character encoding from the HTTP headers, even if your file is plaintext and doesn't have a charset HTML attribute: https://www.w3.org/International/articles/definitions-characters/#httpheader

Setting that would take place I guess in the server. Is there a way for us to modify the headers from the pollen server?

I think you can declare the character encoding from the HTTP headers, even if your file is plaintext and doesn't have a charset HTML attribute: https://www.w3.org/International/articles/definitions-characters/#httpheader Setting that would take place I guess in the server. Is there a way for us to modify the headers from the pollen server?
mbutterick commented 6 years ago (Migrated from github.com)

No, because the files aren’t going to be served dynamically from the Pollen server.

No, because the files aren’t going to be served dynamically from the Pollen server.
jbaum98 commented 6 years ago (Migrated from github.com)

Who's doing the serving then? Whether or not the files are generated dynamically shouldn't affect the http headers on the response from the server.

Who's doing the serving then? Whether or not the files are generated dynamically shouldn't affect the http headers on the response from the server.
mbutterick commented 6 years ago (Migrated from github.com)

The Pollen project server is just a convenience for previewing files during development.

The idea is that when you’re done, you move your rendered files over to your production server (for instance, I use Apache).

IOW, though it would be possible to modify the Pollen project server to do what you suggest, it doesn’t solve the problem in a portable way.

Thus, the best practice is for each file to declare its own encoding.

The Pollen project server is just a convenience for previewing files during development. The idea is that when you’re done, you move your rendered files over to your production server (for instance, I use Apache). IOW, though it would be possible to modify the Pollen project server to do what you suggest, it doesn’t solve the problem in a portable way. Thus, the best practice is for each file to declare its own encoding.
jbaum98 commented 6 years ago (Migrated from github.com)

Text files can't declare their own encodings though, and it seems like at least Firefox has trouble detecting the correct encoding automatically. Therefore for text files you have to rely on the HTTP header.

Also, considering that it seems like a goal of Pollen to have good support for Unicode, and that the Pollen project server is supposed to be used as a convenience for previewing files, it would be a good feature to automatically set the charset in the MIME-TYPE to utf-8.

I've created a pull request to make the necessary changes in #165. Most of the work involves switching from the wrapper serve/servlet to the underlying dispatcher-sequence and serve/launch/wait so that we can manually specify a function for computing the mime-type. That's the only semantic change I meant to make, but its a big diff because it's a little bit of a copy-paste of the implementation of serve/servlet, albeit simplified.

Let me know if you disagree and don't think this is a worthwile feature.

Text files can't declare their own encodings though, and it seems like at least Firefox has trouble detecting the correct encoding automatically. Therefore for text files you have to rely on the HTTP header. Also, considering that it seems like a goal of Pollen to have good support for Unicode, and that the Pollen project server is supposed to be used as a convenience for previewing files, it would be a good feature to automatically set the charset in the MIME-TYPE to `utf-8`. I've created a pull request to make the necessary changes in #165. Most of the work involves switching from the wrapper `serve/servlet` to the underlying `dispatcher-sequence` and `serve/launch/wait` so that we can manually specify a function for computing the mime-type. That's the only semantic change I meant to make, but its a big diff because it's a little bit of a copy-paste of the implementation of `serve/servlet`, albeit simplified. Let me know if you disagree and don't think this is a worthwile feature.
jbaum98 commented 6 years ago (Migrated from github.com)

After a little more research (of course after I already did the work and submitted the pull request!) I realized I was wrong that text files can't declare their own encodings. You can use a byte order mark , which means inserting a special character at the beginning of the file. The BOM character in UTF-8 has the encoding 0xEF 0xBB 0xBF. I've tested this in Firefox myself, and Pollen passes it through no problem. I'll close my pull request also.

After a little more research (of course after I already did the work and submitted the pull request!) I realized I was wrong that text files can't declare their own encodings. You can use a [byte order mark ](https://en.wikipedia.org/wiki/Byte_order_mark), which means inserting a special character at the beginning of the file. The `BOM` character in UTF-8 has the encoding `0xEF 0xBB 0xBF`. I've tested this in Firefox myself, and Pollen passes it through no problem. I'll close my pull request also.
mbutterick commented 6 years ago (Migrated from github.com)

a goal of Pollen to have good support for Unicode

True, though a higher goal is minimizing magic behavior.

> a goal of Pollen to have good support for Unicode True, though a higher goal is minimizing magic behavior.
jbaum98 commented 6 years ago (Migrated from github.com)

A nice way to add the byte order mark to text files automatically if you want it is to use the following template.txt.p:

(local-require racket/list)
(apply string-append "\uFEFF" (filter string? (flatten doc)))
A nice way to add the byte order mark to text files automatically if you want it is to use the following `template.txt.p`: ```racket ◊(local-require racket/list) ◊(apply string-append "\uFEFF" (filter string? (flatten doc))) ```
mbutterick commented 6 years ago (Migrated from github.com)

True, but still magical. I’m ready to formulate some kind of Murphy’s Law of default software behavior — as soon as you impose a change like this, the next bug filed will be “Pollen is erroneously putting a BOM at the front of my text output”.

True, but still magical. I’m ready to formulate some kind of Murphy’s Law of default software behavior — as soon as you impose a change like this, the next bug filed will be “Pollen is erroneously putting a BOM at the front of my text output”.
jbaum98 commented 6 years ago (Migrated from github.com)

I agree for sure! I was just putting it here to share the solution in case somebody else finds this looking for a solution to the same problem. Would definitely be weird to automatically stick in the BOM without telling anyone.

I agree for sure! I was just putting it here to share the solution in case somebody else finds this looking for a solution to the same problem. Would definitely be weird to automatically stick in the BOM without telling anyone.
mbutterick commented 6 years ago (Migrated from github.com)

Maybe you could find an existing place (or propose a new place) in the Pollen documentation for this information? I’m fully in favor of preserving discoveries. But anything in a GH issue is unlikely to be found again.

Maybe you could find an existing place (or propose a new place) in the Pollen documentation for this information? I’m fully in favor of preserving discoveries. But anything in a GH issue is unlikely to be found again.
mbutterick commented 6 years ago (Migrated from github.com)

Or, depending on how strongly you feel about text handling, it could become a ->text function in a new pollen/template/text module that would parallel ->html. One of the options could be encoding, that would handle appending the BOM.

Or, depending on how strongly you feel about text handling, it could become a `->text` function in a new `pollen/template/text` module that would parallel `->html`. One of the options could be encoding, that would handle appending the BOM.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mbutterick/pollen#44
Loading…
There is no content yet.