teaser

Making parallel text PDFs for e-readers

This post explains the scripts: dualtext2pdf and dualtext2html that convert two Markdown formatted texts with alignment cues into a dual text or parallel text PDF. The scripts are intended for generating bilingual e-reader friendly PDFs to practice a foreign language.


The Tinderbox bg-dk on e-reader example Example of a Bulgarian and Danish bilingual text generated by dualtext2pdf, shown on an e-reader. The text is from Hans Christian Andersen's The Tinderbox

Motivation

When learning a new language it can be very useful to study a bilingual text where the same content is presented in one's native language next to the foreign language. I find that this kind of text and useful in many ways. The texts can be casually read picking up words here and there, or be intensely studied like a word puzzle where the meaning of new words are pieced together by deduction and dictionary look-ups.

As a native Danish speaker attempting to learn Bulgarian, I quickly realized that finding a ready made bilingual text with this combination was extremely unlikely. Both are relatively small languages spoken by less than 10 million people. Even finding English / Bulgarian bilingual texts were not easy, and the ones I found weren't free and didn't have the format I was looking for. I wanted a format suitable for display on an e-reader.

Do it youself?

I realized that the Danish author Hans Christian Andersen is a perfect source of material for this endeavor. His works are available as public domain in Danish and in many other languages with out-of-copyright translations. But how to piece the texts together? My first attempt was to manually align the two languages page by page in seperate documents, then print them to PDF files and using various tricks merge them to a single PDF with the desired two column format. The whole process was a pain and I never really got the merging part to work properly.

The solution

Time to scratch one's own itch and make some better tools, and I came up with the scripts: dualtext2html and dualtext2pdf.

The scripts essentially convert two Markdown formatted texts plus styling and configuration settings into PDF and/or HTML outputs. The alignment of the texts is done by inserting horizontal ruler tags in the Markdown in equivalent places in both languages. Note that I use *** in the following to make the horizontal ruler tags. They can be written in many ways in Markdown.

For example, with the first text in Bulgarian:

Огнивото
--------

*Ханс Кристиан Андерсен, 1835*

***
Из широкия път вървеше войник: едно, две, едно, две! На гърба си носеше раница, а на кръста — сабя, защото беше ходил на война и сега се връщаше у дома си. По пътя го срещна една стара магьосница. Тя беше ужасно грозна. Долната й устна висеше до гърдите.
***
— Добър вечер, войниче! — каза магьосницата. — Каква хубава сабя и каква голяма раница имаш! Ти си истински войник! Затова ще получиш пари, колкото искаш.
***
— Благодаря ти, стара магьоснице! — отвърна войникът.

— Виждаш ли това голямо дърво? — рече магьосницата и посочи едно дърво, което растеше наблизо. — То е съвсем кухо отвътре. Ако се изкачиш на върха му, ще видиш една дупка, по която можеш да се спуснеш надолу до самите корени. Аз ще ти вържа едно въже за кръста, за да мога да те извадя, щом ми извикаш.

— Какво ще правя в дървото? — попита войникът.

and the second in Danish:

Fyrtøjet
--------

*Hans Christian Andersen, 1835*

***
Der kom en soldat marcherende hen ad landevejen: én, to! én, to! Han havde sit tornyster på ryggen og en sabel ved siden, for han havde været i krigen, og nu skulle han hjem. Så mødte han en gammel heks på landevejen; hun var så ækel, hendes underlæbe hang hende lige ned på brystet. Hun sagde: 
***
"God aften, soldat! Hvor du har en pæn sabel og et stort tornyster, du er en rigtig soldat! Nu skal du få så mange penge, du vil eje!"
***
"Tak skal du have, din gamle heks!" sagde soldaten.

"Kan du se det store træ?" sagde heksen, og pegede på et træ, der stod ved siden af dem. "Det er ganske hult inden i! Der skal du krybe op i toppen, så ser du et hul, som du kan lade dig glide igennem og komme dybt i træet! Jeg skal binde dig en strikke om livet, for at jeg kan hejse dig op igen, når du råber på mig!"

"Hvad skal jeg så nede i træet?" spurgte soldaten.

with the default stylesheet and settings, a PDF comes out that looks like this on an e-reader:

The Tinderbox bg-dk page1

The Tinderbox bg-dk page2

Notice how the alignment cues aren't actually visible, but they enforce the alignment of a paragraph on the second page.

Documentation

dualtext2html

dualtext2html performs the first step in the process. It takes two Markdown files and a template html, and generates an html file with the texts next to each other using an html table. The horizontal rulers from the Markdown are removed and used for starting new rows in the table such that the texts line up.

dualtext2pdf

The dualtext2pdf script is all about automation of the process and generating a nice PDF output. It runs dualtext2html, wkhtmltopdf and pdftk with templates and configuration settings provided in a config file.

Inputs:

  • Config file
  • Two Markdown files with the source texts
  • HTML template with styling
  • HTML file for first PDF page intended as the book cover (optional)

Outputs:

  • The parallel text pdf
  • A similar parallel text HTML (this doesn't include the cover page)

Usage example:

Install the requirements: python, wkhtmltopdf, pdftk and the python package markdown. This might be done as follows on a Debian/Ubuntu related system:

$ sudo apt intall python python-pip wkhtmltopdf pdftk
$ pip install markdown

Download the scripts:

Make them executable and place them somewhere in the path, eg. in: /usr/bin/ or create symlinks to them there.

Now cd into an empty directory and run the demo function to generate a set of demonstration files:

$ dualtext2pdf demo "New Text"

This generates a number of output files:

dualtext2pdf, version: 2018-02-04
Writing: "New Text.cfg"
Writing: "a6_template.html"
Writing: "New Text_a.md"
Writing: "New Text_b.md"
Writing: "New Text_cover.html"
The cover page html links to a file named: "New Text_cover.jpg" not generated here

The generated config file can be used to compile all that into a PDF:

$ dualtext2pdf "New text.cfg"

And a PDF like this pops out of it. Note that the cover page didn't work out because there were no cover picture.

The config file

Lets take a look at New text.cfg generated with the demo function:

[common]
# The common settings are used unless overridden in an output_x section
page_settings = -L 2 -B 1 -R 2 -T 1 -s A6 -O Landscape
cover_settings = -L 0 -B 0 -R 0 -T 0 -s A6

[output_1]
text1    = New Text_a.md
text2    = New Text_b.md
template = a6_template.html

cover    = New Text_cover.html
make_pdf = New Text.pdf
#make_html = New Text.html

The page_settings and cover_settings are used with wkhtmltopdf when generating PDF files and define the margins and page size as A6 landscape which I find suits an e-reader well.

Multiple outputs can be generated with a single config file by adding further sections like [output_2] and [output_3]. Any option can be placed in either the common section or the output section, but if defined in an output section it's prioritized.

If the cover option is defined, the script will perform two invocations of wkhtmltopdf; for the cover page and for the main content. The generated PDF's will be concatenated by pdftk.

If make_html option is defined, the HTML output is kept (it is always generated). Note that the HTML output does not include a cover page.


Comments

Comments powered by Talkyard