Digital scientific workbooks

Literate coding for reality

November 17, 2014 — September 9, 2024

academe
computers are awful
faster pussycat
how do science
information provenance
plain text
premature optimization
UI
workflow
Figure 1

The exploratory-algorithm-person’s IDE-equivalent. Literate coding meets science. a.k.a. dynamic report generation, a.k.a. literate programming.

Let’s say I want to demonstrate my algorithm to my thesis advisor while he’s off at a conference. I need an easily shareable demonstration. That’s why we have the internet, right? I should be able to interleave text, mathematics, and also code demonstrating the thingy, maybe even some graphs of the output. It should be in a simple format that one can use, execute, and edit as easily as possible. That is what a scientific workbook does; it takes the text and renders all the graphs, tables, and other experimental output as a nicely formatted document. Reproducing the document requires pressing a single button, not laboriously manually executing some inscrutable code snippets, or a bloody spreadsheet.

Everyone wants to make this better, but there are coordination problems, standards problems, and inertia.

See the Rethinking ML Papers for some recent advances.

1 Philosophy

Why do this? To belatedly immanentize the prophecy that the scientific paper is dead. As part of the reproducible/open science process.

Yihui Xie puts it practically: Notebook war summarises some philosophical and practical differences between the literate coding/exploratory notebook hybrid tools in use with an eye to application. Workbook systems are also behind projects like Nextjournal, a collaborative coding machine that claims to make this easy for you and your colleagues to write in a workbook style together. Go and read Explorable explanations for some philosophy, or explorabl.es for some hands-on experiments. A welcoming model one is Nicky Case’s loopy.

Welcome to Explorable Explanations, a hub for learning through play! We’re a disorganized “movement” of artists, coders & educators who want to reunite play and learning.

In fact, most of us just run code most of the time; pragmatic tool sexist to turn that into a real workflow, which I call experiment tracking.

Jeremy Kun on UI for mathematics discusses one of the problems that scientific workbooks are implicitly attempting to solve.

Lots of people struggle with math, and a better user interface for mathematics would immediately usher in a new age of enlightenment. This isn’t an idle speculation. It has happened time and time again throughout history. The Persian mathematician Muhammad ibn Musa al-Khwarizmi invented algebra (though without the symbols for it) which revolutionized mathematics, elevating it above arithmetic and classical geometry, quickly scaling the globe. Make no mistake, the invention of algebra literally enabled average people to do contemporarily advanced mathematics.… Shortly after the printing press was invented French mathematicians invented modern symbolic notation for algebra, allowing mathematics to scale up in complexity. Symbolic algebra was a new user interface that birthed countless new thoughts. Without this, for example, mathematicians would never have discovered the connections between algebra and geometry that are so prevalent in modern mathematics and which lay the foundation of modern physics. Later came the invention of set theory, and shortly after category theory, which were each new and improved user interfaces that allowed mathematicians to express deeper, more unified, and more nuanced ideas than was previously possible.…

In his book “The Art of Doing Science and Engineering,” the mathematician and computer scientist Richard Hamming put this difficulty into words quite nicely,

It has rarely proved practical to produce exactly the same product by machines as we produced by hand. Indeed, one of the major items in the conversion from hand to machine production is the imaginative redesign of an equivalent product. Thus in thinking of mechanising a large organization, it won’t work if you try to keep things in detail exactly the same, rather there must be a larger give-and-take if there is to be a significant success. You must get the essentials of the job in mind and then design the mechanization to do that job rather than trying to mechanize the current version—if you want a significant success in the long run.

Hamming’s attitude about an “equivalent product” summarizes the frustration of writing software. What customers want differs from what they say they want. Automating manual human processes requires arduously encoding the loose judgments made by humans—often inconsistent and based on folklore and experience. Software almost always falls short of really solving your problem. Accommodating the shortcomings requires a whole extra layer of process.

… My imagination may thus defeat itself by failing to give any ground. If a new interface is to replace pencil and paper mathematics, must I give up the ease of some routine mathematical tasks? Or remove them from my thinking style entirely? …

Mathematics succeeds only insofar as it advances human understanding. Pencil and paper may be the wrong tool for the next generation of great thinkers. But if we hope to enable future insights, we must understand how and why the existing tools facilitated the great ideas of the past. We must imbue the best features of history into whatever we build. If you, dear programmer, want to build those tools, I hope you will incorporate the lessons and insights of mathematics.

2 Sharing

mybinder, the flagship instance of binderhub hosts diverse workbooks on the cloud using containers.

BinderHub is a cloud-based technology that can launch a repository of code (from GitHub, GitLab, and others) in a browser window such that the code can be executed and interacted with. A unique URL is generated allowing the interactive code to be easily shared.

Hypergraph is a vaunted new experiment-and-analysis tracking system which promises some collaborative tools. I have not yet tried it.

3 Quarto

See quarto.

4 Living papers

Figure 2

Living Papers: A Language Toolkit for Augmented Scholarly Communication:

We contribute Living Papers, a framework for writing enhanced articles that encompass multiple output types: interactive web pages to enable augmented reading experiences, accessibility, and self-publishing; static PDFs to align incentives and participate in existing publishing workflows; and application programming interfaces (APIs) to enable easy extraction and reuse of both article content and executable code. In sum, Living Papers is a “language toolkit” consisting of a standardized document model and a set of extensible parsers, transforms, and output generators.

To support dynamic reading aids and explorable explanations [62], Living Papers produces web-based articles with a reactive runtime and extensible component system. We use Markdown [20] as a default input format, with syntax extensions for custom components. Articles may include executable code in languages such as JavaScript, R, and Python to generate static or interactive content. To support “backwards compatibility” with current publishing practices, the Living Papers compiler automatically converts interactive and web-based material to static content, and generates LaTeX [36] projects or compiled PDFs using extensible journal and conference templates. To assist not only people but also computers to more easily interpret papers, Living Papers can compile article content into accessible data structures, APIs, and software modules.

It seems to be somewhere between quarto and observablejs in the world of scientific workbooks. In particular, it seems (?) not to require a server to provide advanced data analysis, because it compiles down to the browser execution engine }(?), which is cool, but also probably has some irritating restrictions; my own experience of getting data analysis to work in the browser has been frustrating.

5 Glamorous Toolkit

Glamorous Toolkit

Glamorous Toolkit is the moldable development environment. It is a live notebook. It is a flexible search interface. It is a fancy code editor. It is a software analysis platform. It is a data visualization engine. All in one. And it is free and open-source under an MIT license.

Maybe this is closer to a data dashboard?

6 Deepnote

Deep note

Deepnote is a new kind of data science notebook. Jupyter-compatible with real-time collaboration and running in the cloud. Oh, and it’s free.

Hmm.

7 Jupyter

jupyter is weird enough to have its own notebook. It is somewhat python-centric but pretty good with multiple languages and AFAICT secretly used inside e.g. RStudio.

However, jupyter had a good 10 years to justify itself to me and failed and I will not be coming back here until forced.

8 codebraid

Jupyter competitor codebraid is a literate programming tool.

9 Pweave

Pweave, by Matti Pastell, is the python twin to knitr, in the lineage of literate coding tools. That is to say, it does less interactive notebook stuff and more straight-up report generation stuff.

Pweave is a scientific report generator and a literate programming tool for Python. It can capture the results and plots from data analysis and works well with numpy, scipy and matplotlib.

Max Masnick gives a detailed set up example.

10 Weave.jl

The Julia twin to PWeave or knitr is weave.jl See also Literate.jl. It looks similar.

The code chunk will be run with default options and the output captured.
<<>>==
using Gadfly
x = linspace(0, 2* pi)
println(x)
plot(x = x, y = sin(x)
@

Or you could use RMarkdown/knitr in julia mode. It’s not yet clear to me how graphing works in that case. Changcheng Li claims: “easily”.

11 Tangle

Tangle did well but appears to be little-maintained now.

12 Stencila

TBD: stencila is a GUI for reproducible research, which essentially is a GUI for knitr, like jupyter but for statisticians rather than computer scientists. It is best understood via an example, or the announcement.

Hosted. USD39/month.

13 knitr/RMarkdown

See knitr/RMarkdown.

14 MATLAB

15 Editor/IDE support

Miscellaneous preview support scripts are given in the knitr documentation.

15.1 VS Code

Many. TBD.

16 References

Granger, and Pérez. 2021. Jupyter: Thinking and Storytelling With Code and Data.” Computing in Science Engineering.
Heer, Conlen, Devireddy, et al. 2023. Living Papers: A Language Toolkit for Augmented Scholarly Communication.” In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. Uist ’23.
Jirotka, Lee, and Olson. 2013. Supporting Scientific Collaboration: Methods, Tools and Concepts.” Computer Supported Cooperative Work (CSCW).
Lau, Drosos, Markel, et al. 2020. The Design Space of Computational Notebooks: An Analysis of 60 Systems in Academia and Industry.” In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).
Poore. 2019. Codebraid: Live Code in Pandoc Markdown.” Proceedings of the 18th Python in Science Conference.
Sokol, and Flach. 2021. You Only Write Thrice: Creating Documents, Computational Notebooks and Presentations From a Single Source.” In.