Making pdfs efficiently readable on the web

My experiments with pdf2htmlEX for this site.

Where we are

Scientific papers are distributed as pdfs—Adobe’s “portable document format” intended to decribe exactly how something looks in print. This might be a little silly, given that many articles are never really printed, but it is the de facto standard. And mere fact that we call articles “papers” might indicate that (the abstraction of) physical sheets of paper might remain with us for an indefinite period.

There are innovative attempts (e.g. pubpub) that have taken up the challenge to establish entirely new formats to disseminate knowledge. One goal is to make it easier to generate not-for-print formats.

Also the big publishers make attempts to generate html versions of papers, e.g, here (paywalled, sorry). This works reasonably well, but probably involves some manual tweaking and occasionally yields quite suboptimal results. In the above example, check out Algorithm 3 (ugly rasterized picture) or my manually adapted enumerate-list at the beginning of Section 1.1 (missing space between (G1) and text).

While my points of critique might be minor / overly picky, that paper is actually a simple example. It does, e.g., not use hand-TikZed new math symbols or inline pictures.

Who would use something silly as that?

Well, there are well-known examples.

(In case you don’t recognize by the font which book this page is from, chances are you will not care about my nickpicks above, either.)

Moreover, I do not want to invest manual effort in generating html versions of my preprints.

What I wanted

Let’s face it: we are stuck with pdfs as the main exchange format, and automatic conversions that truely change the layout will always have glitches. Besides, I like the fixed layout and stable pages. So while I have been posting preprint pdfs on my website ever since I have been publishing papers, I have been wondering if there is a better alternative.

What I want is a simple and robust way to render pdfs in the browser. There are different options.

Plugins: not always available, security flaws …
Javascript libraries that implement a full pdf viewer: thanks to massive improvements in javascript engines reasonably efficient, but it is clumsy to link into a pdf.
using some webservice for pdf viewing (e.g scripd): apart from having to entrust your content to and relying on the service of some company, it is unclear whether uploading preprints there would still fall under the author’s website exception. (see, e.g., the policy for Springer LNCS)
(statically) convert pdf to html: until recently, I thought the result would be unacceptably far from the original pdf, but this has changed!

The solution: pdf2htmlEX

By using embedded fonts and HTML 5 typography features, pdf2htmlEX achieves remarkable results. It statically converts a pdf to html—it is not a viewer; the original pdf is not needed afterwards.

What works out of the box:

The html pages look almost identical to the original pdf,
they are searchable,
links are clickable (both within the pdf and to external pages), and
we can use plain html anchor links to link to any specific page within the pdf.

Note that pdf2htmlEX is no longer maintained, and there seems to be no simple solution to run it under a current Ubuntu (18.04); however, I got it running in a docker container.

Configuration

The default of pdf2htmlEX is to produce one big html file with all needed resources inlined. This is convenient, but does not work well for documents with more than 20 pages.

However, pdf2htmlEX also allows to split the resulting html into one file per pdf page, which are loaded dynamically via AJAX. That way, even large files like my dissertation are opened in an instant, and missing pages are rendered very fast. (pdf.js might be an edge faster).

The dynamic loading comes at a price though: Searching the text (via the browser) does no longer work! It works fine for the currently rendered part, but of course not for the rest of the document.

pdf2htmlEX has further options on which other parts to embed; I chose to keep everything in extra files that would be shared by several papers. The final call uses the following options:

pdf2htmlEX --data-dir <my-data-dir> --embed-css 0 --embed-javascript 0 --embed-image 1 --fit-width 1000 --split-pages 1 --page-filename paper-%d.page paper.pdf

My additions to the UI

By default, pdf2htmlEX generates a sidebar with the pdf outline. While generally a useful feature, it was annoying that there was no way to hide it when you would rather have the full window width for content.

I therefore hacked a few additional keyboard shortcuts into the javascript file; here is the overall list of available commands:

Keyboard Shortcuts in pdf2htmlEX

Shortcut	Function	Comment
`+`,`=`	zoom in
`-`	zoom out
`0`	reset view
`o`, `O`	toggle outline	only in my version
`f`, `F`	fit to width	only in my version
`p`, `P`	fit to page height	only in my version
`g`, `G`	go to page	only in my version

It really is a hack at this moment, but in case you would like to use it, you can do by the following procedure:

Copy pdf2htmlEX’s data dir to some convenient location; under Linux it would by found in /use/share/pdf2htmlEX
Replace the shipped pdf2htmlEX.min.js with my version.
Specify a custom data dir when calling pdf2htmlEX.

Remaining glitches

Searching within the full document does not work with dynamically rendered pages.
Searching does not properly work with ligatures. (pdf2htmlEX can remove them, but I would rather not.)
Small spacing issues with some characters (e.g. square brackets). Might be improved with manual hinting of the fonts.
UI is still rudimentary