pandoc

An itemised list of the esoteric difficulties involved in bullet points

July 17, 2019 — September 5, 2024

academe
computers are awful
faster pussycat
lua
plain text
UI
workflow
Figure 1: pandoc (centre) amongst the markup formats, markdown (bottom right) MS office (bottom left), HTML (top left), latex (top right) etc.

The default parser IMO for markdown and a Swiss army knife of conversion for other text markup formats.

Various useful things are built on pandoc, including blogdown, quarto etc.

1 Installing pandoc

I install pandoc via homebrew. If you are using RStudio, you already have an installation inside RStudio. You can access that installation by letting your shell know about the path to it. On macOS this looks like

export PATH=$PATH:/Applications/RStudio.app/Contents/MacOS/pandoc

conda, the python package manager, will obediently install it also. The default version was ancient last time I checked, though. Consider using the conda forge version.

conda install -c conda-forge pandoc

You can also install it by, e.g. a Linux package manager but this is not recommended as it tends to be an even more elderly version and the improvements in recent pandoc versions are great. You could also compile it from source, but this is laborious because it is written in Haskell, a semi-obscure language with hefty installation requirements of its own. There are probably other options, but I don’t know them.

pandoc pro tips

John MacFarlane’s pandoc tricks are the canonical tricks, as John MacFarlane is the boss of pandoc.

2 Converting

The whole thing about pandoc is that it can convert between a lot of formats. Output Format list or run

pandoc  --list-input-formats
pandoc  --list-output-formats
pandoc  --list-extension

I thought pandoc would be idempotent, in the sense that if I convert markdown to markdown it should come out more-or-less the same. It is not at all idempotent. Smart quotes are altered, list markup is changed, header blocks are munged. As such, pandoc cannot really modify stuff “in place”.

3 Document metadata

Use YAML blocks.

4 Headers and macros

You want fancy mathematical macros, or a latex preamble? Something more elaborate still?

Modify a template to include a custom preamble, e.g. for custom document type. Here’s how you change that globally:

pandoc -D latex > ~/.pandoc/templates/default.latex

Or locally:

pandoc -D latex > template.latex
pandoc --template=template.latex …

If you only want some basic macros a document type alteration is probably overkill. Simply prepend a header file

pandoc -H _macros.tex chapter_splitting.md -o chapter_splitting.pdf

NB Pandoc will expand basic LaTeX Macros in even HTML all by itself.

There are many other pandoc template tricks, which end up being important in fancy pandoc-based systems like quarto.

5 Cross references and citations

As discussed also in my citation guide, I use pandoc-citeproc. See also the relevant bit of the pandoc manual.

Cross references are supported by pandoc-crossref or some combination of pandoc-fignos, pandoc-eqnos etc.

You invoke that with the following flags (order important):

pandoc -F pandoc-crossref -F pandoc-citeproc file.md -o file.html

The resulting syntax in markdown is

$$ x^2 $$ {#eq:label}

for labels and, for references,

@fig:label
@eq:label
@tbl:label

or

[@fig:label1;@fig:label2;…]
[@eq:label1;@eq:label2;…]
[@tbl:label1;@tbl:label2;…]

etc.

RMarkdown, while still using pandoc AFAICT, does this slightly differently,

See equation \@ref(eq:linear)

\begin{equation}
a + bx = c  (\#eq:linear)
\end{equation}

quarto does it differently again.

Citations can either be rendered by pandoc itself or passed through to some BibTeX nightmare if you feel that the modern tendency to regard diacritics and other non-English typography as an insidious plot by malevolent agencies.

Citekeys per default look like BibTeX, and indeed BibTeX citations seem to pass through.

\cite{heyns_foo_2014,heyns_bar_2015}

They are rendered in the output by an in-built pandoc filter, which is installed separately:

The preferred pandoc-citeproc format seems to be something with an @ sign and/or occasional square brackets

Blah blah [see @heyns_foo_2014, pp. 33-35; also @heyns_bar_2015, ch. 1].
But @heyns_baz_2016 says different things again.

This is how you output it.

# Using the CSL transform

pandoc -F pandoc-citeproc --csl=apa.csl --bibliography=bibliography.bib \
    -o document.pdf document.md
# or using biblatex and the traditionalist workflow.

pandoc --biblatex --bibliography=bibliography.bib \
    -o document.tex document.md
latexmk document

If you want your reference section numbered, you need some magic:

## References

::: {#refs}
:::

aside: CSL is close to being good for use on websites, but has a flaw: They do not support links, in the sense that there is no general way in the standard to tell a CSL renderer where to put links. There is a hack that may support your use case, although it is not ideal for mine. This is not same as saying links are impossible; it rather means that if you want something different you need to write your own CSL processor with some idiosyncratic URL handling built in, which presupposes that you have access to the source code of whatever tool you use and would like to spend time maintaining a fork of it. Fundamentally, the creators of this tool imagine that we are only using it for writing stuff to be printed out on paper.

6 Tables

Too many types. I usually find the pipe tables easiest since they don’t need me to align text. They look like this:

| Right | Left | Default | Centre |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |

However, they do not support equations that contain pipes, or at least do so unreliably for me. EDIT: no, I think the issue is something other than parsing pipes. I give up in confusion.

7 Figures, algorithms, etc

panflute-filters collates a bunch of useful filters :

pandoc-figures
figures with captions and backmatter support
pandoc-tables
tables with captions, backmatter support, csv support
pandoc-algorithms
support for tex algorithm packages
pandoc-tex
replace arbitrary tex templates

If we would like to include cool vector graphics formats we might use Pandoc Kroki Filter to invoke kroki or d2-filter to invoke D2.

8 DIY filters and extensions #{custom-filters}

The scripting API includes Haskell, and an embedded lua interpreter.

The intermediate representation can be serialised to JSON so we can use any language that handles JSON, if we are especially passionate about some other language e.g. python, or some JSON-specific hack. SDKs for other languages than lua and haskell are based upon the JSON intermediate format.

The lua-flavoured filters seem easy, natural and fast and possibly the most popular, forming major infrastructure for, for example, the document systems quarto and living papers.

Don’t know lua? Pro tip: ChatGPT is really good at lua.

9 Conversion tricks

9.1 Basic markdown

For example, markdown to LaTeX:

pandoc -f markdown -t latex -o document.tex document.md

or from the clipboard:

fish_clipboard_paste | pandoc -f markdown+ -t latex |fish_clipboard_copy

Typically I want it to handle maths without mangling, so I use some flags

fish_clipboard_paste | pandoc -f markdown+tex_math_single_backslash -t latex |fish_clipboard_copy

9.2 Presentations

Presentation output is invoked magically by various scientific workbooks which support presentation backends. I do not know the details.

9.3 reStructuredText to Markdown

Pandoc’s reStructuredText reader exists but is not great. One option for better results is to go via HTML, e.g.

rst2html.py --math-output=MathJax document.rst | pandoc -f html -t
markdown -

This will mangle the mathematical equations.

Or, this will mangle links and headings:

pandoc -f rst -t markdown document.rst

If they are non-trivial documents, I would try ReST-specific converters.