Transforming plain text to HTML for EPUB 3
This post summarizes the formats that I've experimented with in the SEED.html app to create EPUB 3 books. I'll present a minimal goal HTML fragment, and show what the plain text source looks like for each format and what it generates as an html fragment.
This is a list of the formats with links to various javascript implementations of parsers/converters that I've worked with in the SEED.html app.
- Markdown (libraries include MarkdownIt, Showdown.js, CommonMark.js, Marked, Kramed, and using a custom ixml grammar via grammix)
- Asciidoc (using asciidoctor.js)
- Textile (see textile-lang.com)
- Org mode (using org.js)
- LaTex (using latex.js)
- Screenplays (using fountain.io)
The javascript libraries all run in the SEED.html app to convert plain text to html fragments. There is some further tweaking to turn these fragments into chapter XHTML content.
Previous posts
I've blogged previously about some of these setups, and there are a few more samples that you can see on the SEED.html app page. The others I haven't spent much time with, either because they're not a good fit for producing accessible EPUB materials or they don't meet my stated goal of being a simple general format for source content (Fountain and LaTeX respectively). They might be relevant for other types of EPUB content like MathML that has not been my focus.
- Markdown using an Invisible XML grammar
- Markdown using MarkdownIt
- Textile for Album Liner Notes
- Fountain for screenplays
- LaTeX sample because it was possible
The SEED.html app home page has additional sample EPUB projects that you can view and edit;
- Introduction to Asciidoc sample using asciidoctor.js
- Introduction to Org Mode sample using org.js
Format / Output Comparison
This is the goal output html fragment illustrating basic XHTML content for an EPUB chapter including headings, paragraphs, inline semantic emphasis and strong tags.
<h1>Heading 1</h1>
<p>First paragraph text with <em>emphasis</em>.</p>
<h2>Heading 2</h2>
<p>Second paragraph text with <strong>strong</strong>.</p>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
The following sections show the various javascript libraries' plain text input and processed html output to give a flavour of what is possible with the SEED.html app.
Markdown
Here's the basic form of the plain text as markdown.
# Heading 1
First paragraph text with _emphasis_.
## Heading 2
Second paragraph text with **strong**.
* item one
* item two
Here's the output html fragment produced by the
showdown.js parser. It generates nice and clean markup.
Note that the heading elements acquire id attributes that
are stubs generated from the text content.
<h1 id="heading1">Heading 1</h1>
<p>First paragraph text with <em>emphasis</em>.</p>
<h2 id="heading2">Heading 2</h2>
<p>Second paragraph text with <strong>strong</strong>.</p>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
I've mostly been using MarkdownIt rather than
showdown.js because its plugin architecture lets me write custom
handlers for non-standard formats. The custom gillemets delimiters handling
that I describe in the
Language Shift post
was implemented as a MarkdownIt plugin.
Textile
Textile is a markup language (like Markdown) for formatting text in a blog or a content management system (CMS).
Out of the box it lets the author add id/class attributes. This
example puts an id of #jump-here on the
h2 element.
h1. Heading 1
First paragraph with _emphasis_.
h2(#jump-here). Heading 2
Second paragraph with *strong*.
* item one
* item two
Equally clean html fragment, and a couple of additional features out of the box that I've found useful in authoring EPUBs.
Textile makes available em, i,
strong and b. (markdown only provides
em and strong.) This flexibility is of value when
looking in detail at EPUB accessibility for example if read-aloud scenarios.
<h1>Heading 1</h1>
<p>First paragraph with <em>emphasis</em>.</p>
<h2 id="jump-here">Heading 2</h2>
<p>Second paragraph with <strong>strong</strong>.</p>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
Asciidoc
AsciiDoc is a plain text markup language for writing technical content.
Asciidoctor.js is the Ruby implementation translated into javascript using Opal. There are a lot of features, and the html fragment gets a lot of structural markup that's not necessarily a good fit for accessible EPUB.
= Heading 1
First paragraph text with _emphasis_.
== Heading 2
Second paragraph text with *strong*.
* item one
* item two
The generated HTML tends to have div wrappers where EPUB wants
clean paragraphs. Some of this can be configured, but I found myself fighting
unwanted structural markup.
Like showdown and textile it usefully adds id attributes on
headings, which can be important for EPUB navigation and table of contents.
<h1 id="id-heading-1" class="sect0">Heading 1</h1>
<p>First paragraph text with <em>emphasis</em>.</p>
<div class="sect1">
<h2 id="id-heading-2">Heading 2</h2>
<div class="sectionbody">
<p>Second paragraph text with <strong>strong</strong>.</p>
<div class="ulist">
<ul>
<li>
<p>item one</p>
</li>
<li>
<p>item two</p>
</li>
</ul>
</div>
</div>
</div>
LaTeX
LaTeX is a document preparation system used for the communication and publication of scientific documents.
It's an odd fit for the goal of plain text source because the plain text is quite heavy with formatting instructions.
\documentclass{book}
\begin{document}
\chapter*{Heading 1}
First paragraph text with \emph{emphasis}.
\section*{Heading 2}
Second paragraph text with \bfseries{strong}.
\begin{itemize}
\item item one
\item item two
\end{itemize}
\end{document}
The generated markup has undesirable spans for inline emphasis/strong and the unordered list structure is a bit of a mess. The library is more interesting to me for its drawing capabilities than as a top-level html fragment generator. I will write more about this in another post.
<h1>Heading 1</h1>
<p>First paragraph text with <span class="it">emphasis</span>.</p>
<h2>Heading 2</h2>
<p>Second paragraph text with <span class="bf">strong</span><span class="bf">.</span></p>
<ul class="list">
<li>
<span class="itemlabel"><span class="hbox llap">•</span></span>
<p>item one</p>
</li>
<li>
<span class="itemlabel"><span class="hbox llap">•</span></span>
<p>item two</p>
</li>
</ul>
</div>
Org Mode
Org Mode is an authoring tool and a TODO lists manager for GNU Emacs.
org.js is a parser and converter for org-mode notation.
There's some messing around with title here to force an
h1 tag in the output, but for a chapter I'd be happy with h2 as
the chapter title.
#+title: Header 1
First paragraph text with /emphasis/.
** Header 2
Second paragraph text with *strong*.
- item one
- item two
The generated markup is clean, but again leans on i and
b presentation tags rather than semantic em and
strong.
I like it and I'm leaving it on a list of formats to explore when I need more structured output than markdown or textile provide.
<h1>Header 1</h1>
<p>First paragraph text with <i>emphasis</i>.</p>
<h2 id="header-0-1"><span class="section-number">0.1</span>Header 2</h2>
<p>Second paragraph text with <b>strong</b>.</p>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
Fountain
Fountain is a plain text markup language for screenwriting.
I find it interesting as a domain-specific plain text format that almost entirely does away with markup.
The example input here is contrived to render the goal output. This is not what the format is designed for. It's designed for screenplays and does a phenomenal job at that task.
There's a fountain sample epub on the SEED.html home page that uses it in a more idiomatic way to create an EPUB format screenplay. Check that out.
Title:
Heading 1
First paragraph text with *emphasis*.
>Heading 2
Second paragraph text with **strong**.
<ul><li>item one </li><li>item two</li></ul>
<h1>Heading 1</h1>
<p>First paragraph text with <span class="italic">emphasis</span>.</p>
<h2 id="heading-smash-cut-to">Heading 2</h2>
<p>Second paragraph text with <span class="bold">strong</span>.</p>
<p></p>
<ul>
<li>item one </li>
<li>item two</li>
</ul>
Future directions
Writing Invisible XML grammars and parsing/converting sources using Grammix. This seems really promising for encapsulating a grammar for custom markup within the EPUB itself.
I'm imagining a custom ixml grammar that takes textile as a starting point - block elements have opening delimeter tag name, and id/class, and some conventions for defining inline elements. Kind of riffing on the 'language shift' idea described in an earlier post.
Next up: other plain text formats for special purposes
In a future post I'll show the workings for my experiments around processing plain text representations for diagrams, music and custom transforms of code blocks in EPUB 3.
- ABC music notation (using abcjs and abc2svg)
- Syntax highlighting (using highlight.js and prism.js)
- SVG Diagrams (using d3.js, mermaid.js and latex.js)