PDF accessibility #15

Closed
opened 5 years ago by dapperjapper · 12 comments
dapperjapper commented 5 years ago (Migrated from github.com)

I know this project is in it's early stages and I don't want to seem ungrateful for the open source contributions you've brought to the typography community, but I wanted to make sure this is on your radar for this project.

PDFs have a specification that allows them to annotate semantic structure ('tags') on top of whatever visual stuff they contain. This runs parallel to PDF bookmarks, which are helpful for jumping around the table of contents, but tags have even lower-level structure. (I don't know much about the technical stuff so correct me if I'm wrong.)

PDF tags make PDFs readable for screen readers, and are required within certain institutions that have rules around accessibility and equitable access. (For example, in the US Federal agencies must be "508 compliant.")

Word (and probably some other typesetting tools) generate tagged PDFs, but LaTeX has never really been able to do this. I think it would be an amazing contribution to the state of accessibility in academia if there were a typesetting tool that did generate tagged PDFs.

I am interested in helping with this, but I'm not sure if this project is in a state that's ready for collaboration yet. Also, I have experience in high-level programming languages but I would probably have to do some reading on PDF compilers.

I know this project is in it's early stages and I don't want to seem ungrateful for the open source contributions you've brought to the typography community, but I wanted to make sure this is on your radar for this project. PDFs have a specification that allows them to annotate semantic structure ('tags') on top of whatever visual stuff they contain. This runs parallel to PDF bookmarks, which are helpful for jumping around the table of contents, but tags have even lower-level structure. (I don't know much about the technical stuff so correct me if I'm wrong.) PDF tags make PDFs readable for screen readers, and are required within certain institutions that have rules around accessibility and equitable access. (For example, in the US Federal agencies must be "508 compliant.") Word (and probably some other typesetting tools) generate tagged PDFs, but [LaTeX has never really been able to do this.](https://twitter.com/jasperclarkberg/status/1038865653581709312) I think it would be an amazing contribution to the state of accessibility in academia if there were a typesetting tool that *did* generate tagged PDFs. I am interested in helping with this, but I'm not sure if this project is in a state that's ready for collaboration yet. Also, I have experience in high-level programming languages but I would probably have to do some reading on PDF compilers.
dapperjapper commented 5 years ago (Migrated from github.com)

Tagging wkhtmltopdf/wkhtmltopdf/issues/1605

Tagging wkhtmltopdf/wkhtmltopdf/issues/1605
mbutterick commented 5 years ago (Migrated from github.com)

I am open to looking at this. With the caveat that replacing LaTeX is not a design goal of Quad, so even in the best case, my net “contribution to the state of accessibility in academia” is likely zero.

I am open to looking at this. With the caveat that replacing LaTeX is not a design goal of Quad, so even in the best case, my net “contribution to the state of accessibility in academia” is likely zero.
mbutterick commented 5 years ago (Migrated from github.com)

Questions from someone who now knows slightly less than nothing about PDF tags from reading about them in the PDF Reference.

PDF tags seem to be a way for Adobe to shoehorn an affordance for XML- or HTML-style markup into the PDF format. It seems like an awkward fit, because the foundational idea of PDF is more of a drawing model (since PDF is derived from PostScript), not a structural model. BTW this is one of many reasons I have misgivings about the PDF format — it’s an excellent paper simulator, but continues to grow like the Winchester Mystery House into jobs it’s not suited for.

Moreover it seems like PDF tags depend on three ingredients:

  1. the insertion of tags in the source document (by an author manually, or by authoring software automatically).
  2. the parsing & interpretation of tags by PDF reader software.
  3. agreement between the authoring software and the reader software about the names of the tags and what they mean (that is, the same trouble that has existed between HTML pages and web browsers for 25 years)

If that much is accurate (I invite correction if not), then how has this worked in practice? For instance, what reader applications support PDF tags? You mention screen readers — are PDF tags core to what they do, or incidental? Are there certain tagging conventions that are observed? And then on the other side — what is the typical workflow for authors? Are PDF tags used for all documents, or mostly those that are being converted from HTML or XML? What tags are supported? (I don’t expect you to know all these answers, but if you have a link to other resources I could study, that would be helpful.)

What I’ve learned the hard way is that following a specification in the PDF Reference is pretty useless. In practice there are idiomatic expectations of the various programs that handle PDFs. So if you don’t implement a feature consistently with those expectations, it’s a waste.

Questions from someone who now knows slightly less than nothing about PDF tags from reading about them in the PDF Reference. PDF tags seem to be a way for Adobe to shoehorn an affordance for XML- or HTML-style markup into the PDF format. It seems like an awkward fit, because the foundational idea of PDF is more of a drawing model (since PDF is derived from PostScript), not a structural model. BTW this is one of many reasons I [have misgivings about](https://practicaltypography.com/why-theres-no-e-book-or-pdf.html#the-problem-with-pdfs) the PDF format — it’s an excellent paper simulator, but continues to grow like the Winchester Mystery House into jobs it’s not suited for. Moreover it seems like PDF tags depend on three ingredients: 1) the insertion of tags in the source document (by an author manually, or by authoring software automatically). 2) the parsing & interpretation of tags by PDF reader software. 3) agreement between the authoring software and the reader software about the names of the tags and what they mean (that is, the same trouble that has existed between HTML pages and web browsers for 25 years) If that much is accurate (I invite correction if not), then how has this worked **in practice**? For instance, what reader applications support PDF tags? You mention screen readers — are PDF tags core to what they do, or incidental? Are there certain tagging conventions that are observed? And then on the other side — what is the typical workflow for authors? Are PDF tags used for all documents, or mostly those that are being converted from HTML or XML? What tags are supported? (I don’t expect you to know all these answers, but if you have a link to other resources I could study, that would be helpful.) What I’ve learned the hard way is that following a specification in the PDF Reference is pretty useless. In practice there are idiomatic expectations of the various programs that handle PDFs. So if you don’t implement a feature consistently with those expectations, it’s a waste.
mbutterick commented 5 years ago (Migrated from github.com)

I’m going to reframe this as a project.

I’m going to reframe this [as a project](https://github.com/mbutterick/quad/projects/1).
dapperjapper commented 5 years ago (Migrated from github.com)

Got it. I'm still meaning to gather more conclusive answers to your questions, but I apologize for the delay! Should I continue the discussion here if I have more information?

Got it. I'm still meaning to gather more conclusive answers to your questions, but I apologize for the delay! Should I continue the discussion here if I have more information?
mbutterick commented 5 years ago (Migrated from github.com)

Sure, there’s no rush of course. You can reopen the issue when there’s actionable information available.

Sure, there’s no rush of course. You can reopen the issue when there’s actionable information available.
dapperjapper commented 5 years ago (Migrated from github.com)

I'm sure you already know some of this, but I'm collecting resources here on my deep dive into pdf formatting. Once I feel like I have a base-level understanding of the format, I'll do more research on what tools you are using to output PDFs (Pango?) and the facilities available for including tag structure.

How to see the internal tagging structure in a PDF

  1. Open the print production tool in Adobe Acrobat Pro and click "Preflight"
    image

  2. Click Options -> Browse Internal PDF Structure...
    image

  3. Make sure you have selected the light bulb tab in the upper left corner, and browse StructTreeRoot. I have highlighted an "Artifact" tag in my example tagged PDF.

image

Here's an example of an Object under StructTreeRoot with alt text (for an image):
image

Resources on tagging in PDFs

I'm still working through this W3 document. Check out the code example under PDF21 for what tagging looks like.

Interesting background on PDF encoding

Adobe's standards document (linked from the W3 doc)
(check out section 14.8.4 "Standard Structure Types")

I'm sure you already know some of this, but I'm collecting resources here on my deep dive into pdf formatting. Once I feel like I have a base-level understanding of the format, I'll do more research on what tools you are using to output PDFs (Pango?) and the facilities available for including tag structure. # How to see the internal tagging structure in a PDF 1) Open the print production tool in Adobe Acrobat Pro and click "Preflight" ![image](https://user-images.githubusercontent.com/246279/58584800-22c13900-8225-11e9-9771-cf67a3a96f50.png) 2) Click Options -> Browse Internal PDF Structure... ![image](https://user-images.githubusercontent.com/246279/58584877-543a0480-8225-11e9-8d6e-4d85dfb0eb6f.png) 3) Make sure you have selected the light bulb tab in the upper left corner, and browse StructTreeRoot. I have highlighted an "Artifact" tag in my [example tagged PDF](https://www.consumerfinance.gov/documents/7310/cfpb_consumer-credit-trends_first-time-homebuying-servicemember-mortgages_022019.pdf). ![image](https://user-images.githubusercontent.com/246279/58585961-0672cb80-8228-11e9-9b05-de2308b3ed05.png) Here's an example of an Object under StructTreeRoot with alt text (for an image): ![image](https://user-images.githubusercontent.com/246279/58586621-82b9de80-8229-11e9-9057-49e071519edc.png) # Resources on tagging in PDFs I'm still working through [this W3 document](https://www.w3.org/TR/WCAG20-TECHS/pdf). Check out the code example under PDF21 for what tagging looks like. [Interesting background on PDF encoding](https://blog.idrsolutions.com/2013/01/understanding-the-pdf-file-format-overview/) [Adobe's standards document (linked from the W3 doc)](http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf) (check out section 14.8.4 "Standard Structure Types")
dapperjapper commented 5 years ago (Migrated from github.com)

If you could provide more information on how quad hooks into a lower-level drawing interface, that would be super helpful! I've been looking at your code and I can't find any references to pango or anything.

I also understand that quad is in its early stages and the connections it has to the underlying graphics system may change, so perhaps this project should wait a bit.

If you could provide more information on how quad hooks into a lower-level drawing interface, that would be super helpful! I've been looking at your code and I can't find any references to pango or anything. I also understand that quad is in its early stages and the connections it has to the underlying graphics system may change, so perhaps this project should wait a bit.
mbutterick commented 5 years ago (Migrated from github.com)

The lower-level parts aren’t documented yet because they’re still in flux. I don’t use Pango, however, nor any other library — I make the PDFs from scratch.

The lower-level parts aren’t documented yet because they’re still in flux. I don’t use Pango, however, nor any other library — I make the PDFs from scratch.
dapperjapper commented 5 years ago (Migrated from github.com)

Oh I guess I misread your reference to Pango in the documentation! I will keep an eye on the codebase as you develop it. Please let me know if you want anything from me 😃

Oh I guess I misread your reference to Pango in the documentation! I will keep an eye on the codebase as you develop it. Please let me know if you want anything from me 😃
mbutterick commented 5 years ago (Migrated from github.com)

It's unclear, sorry — where it says "These facilities are provided by Pango", "these" = the facilities provided by racket/draw = the ones I am avoiding.

On May 31, 2019, at 10:56 AM, Jasper notifications@github.com wrote:

Oh I guess I misread your reference to Pango in the documentation! I will keep an eye on the codebase as you develop it. Please let me know if you want anything from me 😃


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub https://github.com/mbutterick/quad/issues/3?email_source=notifications&email_token=AAK35G5XQDHIUYB7JKFTK2LPYFRFDA5CNFSM4HHRMRSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWV6FJQ#issuecomment-497803942, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK35G2XEBHWFUXOEREEIYTPYFRFDANCNFSM4HHRMRSA.

It's unclear, sorry — where it says "These facilities are provided by Pango", "these" = the facilities provided by `racket/draw` = the ones I am avoiding. > On May 31, 2019, at 10:56 AM, Jasper <notifications@github.com> wrote: > > Oh I guess I misread your reference to Pango in the documentation! I will keep an eye on the codebase as you develop it. Please let me know if you want anything from me 😃 > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub <https://github.com/mbutterick/quad/issues/3?email_source=notifications&email_token=AAK35G5XQDHIUYB7JKFTK2LPYFRFDA5CNFSM4HHRMRSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWV6FJQ#issuecomment-497803942>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAK35G2XEBHWFUXOEREEIYTPYFRFDANCNFSM4HHRMRSA>. >
mbutterick commented 2 years ago (Migrated from github.com)

Most of my PDF-making code was ported from the pdfkit project a few years ago. In the past year, pdfkit has added support for Tagged PDF. So the most likely path to supporting Tagged PDF is to go back to pdfkit and port this new code into quad (or more specifically pitfall, which is the PDF-generating part of the library)

Most of my PDF-making code was ported from the [`pdfkit`](https://github.com/foliojs/pdfkit) project a few years ago. In the past year, `pdfkit` has added [support for Tagged PDF](https://github.com/foliojs/pdfkit/commit/c177874e827823c81077af753660ec42cc26da07). So the most likely path to supporting Tagged PDF is to go back to `pdfkit` and port this new code into `quad` (or more specifically `pitfall`, which is the PDF-generating part of the library)
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mbutterick/typesetting#15
Loading…
There is no content yet.