The Jupyter Cinematic Universe

A constellation of somewhat-compatible technologies from which we can extract a compromise between ease of 1) actually doing data science and 2) seeming to laypeople to be doing data science 3) begrudging IT support

February 9, 2017 — November 4, 2024

faster pussycat
premature optimization
python
UI
Figure 1

The python-derived entrant in the scientific workbook field is called jupyter.

Interactive “notebook” computing for various languages; python/julia/R/whatever plugs into the “kernel” interface. Jupyter allows easy(ish) online-friendly worksheets, which are both interactive and easy to export for static online use. This is handy. Handy enough that it’s sometimes worth the many rough spots, and so I conquer my discomfort and use it.

1 Why jupyter?

What does jupyter buy us? Is it worth the set-up time configuring this contraption?

It took me a long time to realise that part of the answer to the first question is that jupyter is a de facto standard for running remote computation jobs interactively. The browser-based, network-friendly jupyter notebook is a natural, easy way to execute tedious computations on some other computer somewhere else, with some kind of a paper trail. In particular, it is much better over unreliable networks than are remote terminals or remote desktops, because the client/server architecture doesn’t need to do so many round-trips to get the code output back to the user. A good feature of jupyter (maybe its best) is a kind of re-designed network terminal. Certainly, if what we need to do could be executed over remote desktop or jupyter, jupyter is going to be less awful over laggy network connections, when every mouse click and keystroke involves waiting and twiddling your fingers.

What else? People make UX arguments, e.g. that jupyter is friendly and supports interactive plots and so on. I am personally ambivalent about those arguments. Jupyter can do some things better than the console. That artificially restricted comparison should not reassure us; we are not limited to the console. On the other hand, most things that jupyter does, it does worse than a proper IDE or decent code editor. Sometimes those other tools are not available on, say, the local HPC cluster or cloud compute environment, and then this becomes a relevant advantage. Usually though, we can install VS Code Remote, unless we have angered the sysadmins.

But for now the main takeaway, I think, is that if, like me, you are confused by jupyter enthusiasts claiming it is “easy” or “fun”, it may make more sense if you mentally append the proviso “…in comparison to some other horrible thing which I was forced to use by ignorance or circumstance several years ago.”

There are other comparisons to make — some like jupyter as a documentation format/literate coding environment. Once again, sure, it is better than text files. But then, Quarto is more portable, VS Code notebook mode versions better etc.

We can generically answer “Is it worth it?” with “That depends on the alternatives. Jupyter is adequate, and commonly available.”

2 Alternatives

My ambivalence about jupyter leads me to consider when it is worth considering other options for interactive code execution in python.

First, be aware there are many variant front-ends to jupyter, which ameliorate some of the pain points of the jupyter notebook interface, e.g. quarto uses jupyter python kernels, but disregards the jupyter notebook interface in favour of a more traditional document-based interface.

But also! There are other interactive python environments which entirely re-imagine the python notebook concept.

2.1 Marimo

marimo is a python-specific notebook alternative which solves many pain points of jupyter (HT Jean-Michel Perraud).

FAQ - marimo

marimo solves problems in reproducibility, maintainability, interactivity, reusability, and shareability of notebooks.

Reproducibility. In Jupyter notebooks, the code you see doesn’t necessarily match the outputs on the page or the program state. If you delete a cell, its variables stay in memory, which other cells may still reference; users can execute cells in arbitrary order. This leads to widespread reproducibility issues. One study analysed 10 million Jupyter notebooks and found that 36% of them weren’t reproducible.

In contrast, marimo guarantees that your code, outputs, and program state are consistent, eliminating hidden state and making your notebook reproducible. marimo achieves this by intelligently analysing your code and understanding the relationships between cells, and automatically re-running cells as needed.

Maintainability. marimo notebooks are stored as pure Python programs (.py files). This lets you version them with git; in contrast, Jupyter notebooks are stored as JSON and require extra steps to version.

Interactivity. marimo notebooks come with UI elements that are automatically synchronised with Python (like sliders, dropdowns); eg, scrub a slider and all cells that reference it are automatically re-run with the new value. This is difficult to get working in Jupyter notebooks.

Reusability. marimo notebooks can be executed as Python scripts from the command-line (since they’re stored as .py files). In contrast, this requires extra steps to do for Jupyter, such as copying and pasting the code out or using external frameworks. In the future, we’ll also let you import symbols (functions, classes) defined in a marimo notebook into other Python programs/notebooks, something you can’t easily do with Jupyter.

Shareability. Every marimo notebook can double as an interactive web app, complete with UI elements, which you can serve using the marimo run command. This isn’t possible in Jupyter without substantial extra effort.

2.2 codebraid

codebraid, a “live code”-style reworking of the jupyter notebook concept.

3 Python-specific

Jupyter supports multiple languages, but is itself built in python (and javascript) and there are some python-specific bits. See IPython.

4 Engineering pain points

Figure 2: Jupyter notebook in action

What technical difficulties can we expect when using jupyter?

tl;dr Not to besmirch the efforts of the jupyter developers who are doing a difficult thing, in many cases for free, but I will complain about jupyter notebook with the justification that it is best to go into these things with your eyes open. Jupyter is often touted as a wonderful solution for data science which makes stuff generally easier, but seems to me to merely offer a different selection of pain points to traditional methods, where the pain points are often surprising and novel, which does not make them better.

I’m an equivocal advocate of the jupyter notebook interface, which some days seems to counteract every plus with a minus, but at least is a pretty easy starting point. This is partly due to the particulars of jupyter’s design decisions, and partly because of the problems of notebook interfaces generally (Chattopadhyay et al. 2020). As with so many computer interfaces, my lukewarm endorsement is, in relative terms, fannish enthusiasm because often, as presaged, the alternatives are abysmal.

Jupyter: It’s friendly to use, but hard to install. It’s easy to graphically explore your data, but hard to keep that exploration in version control. It makes it easy to explore my code’s graphical output, but clashes with the fancy debugger that would make it easy to explore my code bugs. It is open source, and written in an easy scripting language, python, so it seems it should be easy to tweak to taste. In practice it’s an ill-explained spaghetti of python, javascript, compiled libraries and browsers that relate to one another in obscure ways that few people with a day job have time to understand or contribute to. There have been so many reboots, rewrites, re-architectures, and reorganisations that it’s hard to know what is going on. Each line of development takes place in a separate timeline in an extended cinematic multiverse, wherein the writer’s rooms are occasionally merged, but the timelines are never reconciled. Things regularly break either at the server or client side and I might need to upgrade either or both to fix it. I might have many different installs of each and need to upgrade a half-dozen different installs to keep them all working, because jupyter is deeply intertwined with the paint points python packaging hell and in many ways makes them worse by multiplying the number of python environments I need to manage. It claims to be extensible but if I use any extensions, it is a constant struggle to keep jupyter finding the many intricate dependencies that are needed to keep the entire contraption running. The sum total is IMO no more easy to run than most of the other UI development messes that we tolerate in academic software. Case study: look a dependency of a dependency of the autocomplete function broke something and thus spawned a multi-month confusion of cascading problems and cost me several hours to fix across the few dozen different python environments I manage across several computers. This kind of tedious intermittent breakage is much the cost of doing business with jupyter, and has been so for as long as I have been using the project, which is as long as it has existed.

These pain points are perhaps not so intrusive for projects of small-to-intermediate complexity and/or longevity. Indeed, jupyter seems good at making quick data science projects look smooth, shiny, and inviting. That is, at the crucial moment when I need to make my data science project look sophisticated-yet-friendly, it lures colleagues into my web(-based IDE). Then it is too late mwhahahahah you have fallen into my trap now you are committed you had better fund budget to maintain this mess. This entrapment might be a feature not a bug, as far as the realities of team dynamics and their relation to software development and organisational support. We want to lure people in until our problems become their problems, because a problem shared is a problem divided. Also shared trauma is a bonding experience.

Some argue that the weird / irritating constraints of jupyter can even lead to good architecture. See Guillaume Chevallier and Jeremy Howard. This sounds like an interactive twist on the old test-driven-development rhetoric. I could be persuaded of its merits, if I found time in between all the debugging.

I think of the famous adage “The fastest code is code that doesn’t need to run, and the best code is code you don’t need to write”. The uncharitable corollary might be “Thus, let’s make writing code horrible so that you write less of it”. That is not even necessarily a crazy position, and if that is what Guillaume and Jeremy are saying, I guess I’ll take it.

Here is some verbiage by Will Crichton which explores some of these themes, The Future of Notebooks: Lessons from JupyterCon.

These rants have all been about running jupyter. If I would like to develop or extend jupyter, I have more pain to deal with. Not recommended unless it is your actual literal job. “How hard could it be to add just one little feature to this friendly little framework?” one might think. Answer: Extremely.

5 Terminology

Pain point: The lexicon of jupyter is confusing. Terminology tarpit alert.

A notebook is on one hand a style of interface, to which jupyter conforms to one interpretation of. Other applications with a notebook style of interface are Mathematica and MATLAB.

Jupyter interfaces communicate with a computational backend, which is called a kernel1

These are software packages in which a unit of development is a type of notebook file on your disk, containing both code and output of that code. (In the case of jupyter this file format is marked by file extension .ipynb, which is short for “ipython notebook”, for fraught historical reasons.) One implementation of a notebook frontend interface over a notebook protocol for jupyter is called the jupyter notebook, launched by the jupyter notebook command which will open up a javascript-backed notebook interface in a web browser. Another, more recent notebook-style interface implementation is called jupyter lab, which additionally uses much of the same jupyter notebook infrastructure but is distinct and only sometimes interoperable in ways which I do not pretend to know in depth. But there are multiple ‘frontends’ besides which interact over the jupyter notebook protocol to talk to a kernel.

Which sense of notebook is intended you have to work out from context, e.g. the following sentence is contentful:

You dawg, I heard you like notebooks, so I started up your jupyter notebook kernel in jupyter notebook

6 Jupyter as UI

See jupyter UI.

7 Front ends

See jupyter front ends.

8 Jupyter kernels

Jupyter kernels now come in (at least) 2 flavours. I do not know to what extent they are interchangeable. Classic flavour is ipykernel which is shambolic but works. xeus is a new entrant.

8.1 ipykernel

8.1.1 Custom ipykernels

jupyter looks for kernel specs in a kernel spec directory, depending on my platform.

Say my kernel is dan; then the definition can be found in the following location:

  • Unixey: ~/.local/share/jupyter/kernels/dan/kernel.json
  • macOS: ~/Library/Jupyter/kernels/dan/kernel.json
  • Win: %APPDATA%\jupyter\kernels\dan\kernel.json

See the manual for details.

How to set up jupyter to use a virtualenv (or other) kernel? tl;dr Do this from inside the virtualenv to bootstrap it:

pip install ipykernel
python -m ipykernel install --user --name=my-virtualenv-name

Addendum: for Anaconda, we can auto-install all discoverable conda envs, which worked for me, whereas the ipykernel method did not.

conda install nb_conda_kernels

8.1.2 Custom kernel lite

e.g. if I wish to run a kernel with different parameters, for example with a GPU-enabled launcher. See here for a worked example for GPU-enabled kernels:

For computers on Linux with optimus, you have to make a kernel that will be called with optirun to be able to use GPU acceleration.

I made a kernel in ~/.local/share/jupyter/kernels/dan/kernel.json and modified it thus:

{
    "display_name": "dan-gpu",
    "language": "python",
    "argv": [
        "/usr/bin/optirun",
        "--no-xorg",
        "/home/me/.virtualenvs/dan/bin/python",
        "-m",
        "ipykernel_launcher",
        "-f",
        "{connection_file}"
    ]
}

Any script called can be set up to use CUDA but not the actual GPU, by setting an environment variable in the script, which is handy for kernels. So this could be in a script called noprimusrun:

CUDA_VISIBLE_DEVICES= $*

8.2 Xeus

Xeus kernels reimplement jupyter kernels in C++. I think they are language agnostic (well, work with any language that can bind to C++ which is presumably everything). The python one is the most mature, AFAICT.

A new Python kernel for Jupyter.

Long story short:

  • xeus-python is a lot lighter than ipykernel, which makes it a lot easier to implement new features on top of it.
  • xeus-python already works with the Jupyter Lab debugger
  • xeus-based kernels are more versatile in that one can overload e.g. the concurrency model. This is something that Kitware’s SlicerJupyter project takes advantage of to integrate with the Qt event loop of their Qt-based desktop application.

9 Jupyterlite

JupyterLite is a JupyterLab distribution that runs entirely in the browser built from the ground-up using JupyterLab components and extensions.

10 Hosting static jupyter notebooks on the web

Various options. For one, GitHub will attempt to render jupyter notebooks in GitHub repos.; I have had various glitches and inconsistencies with images and equations rendering in such notebooks. Perhaps it is better in…

The fastest way to share your notebooks - announcing NotebookSharing.space - Yuvi Panda

You can upload your notebook easily via the web interface at notebooksharing.space: Once uploaded, the web interface will just redirect you to the beautifully rendered notebook, and you can copy the link to the page and share it!

Or you can directly use the nbss-upload command line tool: …

When uploading, you can opt-in to have collaborative annotations enabled on your notebook via the open source, web standards-based hypothes.is service. You can thus directly annotate the notebook, instead of having to email back and forth about ‘that cell where you are importing matplotlib’ or ‘that graph with the blue border’. This is one of the coolest features of notebooksharing.space.

11 Hosting live jupyter notebooks on the web

Jupyter can host online notebooks, even multi-user notebook servers — if you are brave enough to let people execute weird code on your machine. I’m not going to go into the security implications. tl;dr encrypt and password-protect that connection. Here, see this jupyterhub tutorial.

11.1 Commercial notebook hosts

NB: This section is outdated. 🏗; I should probably mention the ill-explained Kaggle kernels and Google Cloud ML execution of same, etc.

Base level, you can run one using a standard cloud option like buying compute time as a virtual machine or container, and using a jupyter notebook for their choice of data science workflow.

  • Kaggle Kernels are somehow also kaggle notebooks now or something? Anyway, it seems to execute code.

  • Paperspace - Gradient Notebooks

    Gradient Notebooks is a web-based Jupyter IDE with free GPUs & IPUs.

  • sagemath runs notebooks online as part of their cocal service, with fancy features. There is a free tier and unspecified other pricing. Messy design but tidy open-source ideals.

  • Anaconda.org appears to be a python package development service, but they also have a sideline in hosting notebooks. ($7/month) Requires you to use their Anaconda Python distribution tools to work, which is… a plus and a minus. The Anaconda Python distro is simple for scientific computing, but if your hard disk is as full of Python distros as mine is you tend not to want more confusing things and wasting disk space.

  • Microsoft’s Azure notebooks

    Azure Notebooks is a free hosted service to develop and run Jupyter notebooks in the cloud with no installation. Jupyter (formerly IPython) is an open source project that lets you easily combine markdown text, executable code (Python, R, and F#), persistent data, graphics, and visualisations onto a single, shareable canvas called a notebook.

  • Google’s Colaboratory is hip now

    Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.

    With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.

    Here is an intro and here is another

12 Pro tips and gotchas

12.1 Meta tips

Anne Bonner’s Tips, Tricks, Hacks, and Magic: How to Effortlessly Optimise Your Jupyter Notebook is actually full of useful stuff. So much stuff that upon reading it, I nearly forget my past traumas with jupyter notebooks. If you must use jupyter, read her article and it will make stuff seem better. Many tips on this page I gleaned from her work

12.2 boilerplate

%%writefile basic_imports.py
%load basic_imports.py

12.3 Run a notebook without the jupyter server

See jupyter command line.

12.4 Offline MathJax in jupyter

e.g. for latex free mathematics.

python -m IPython.external.MathJax /path/to/source/MathJax.zip

12.5 I can’t see part of the cell!

Sometimes, you can’t see the whole code cell; part of it overflows into some weird hidden alternate dimension. This is a known issue to do with vanishing scrollbars. The workaround is simple enough:

zooming out to 90% and zooming back in to 100%, Ctrl + - / +

12.6 IOPub data rate exceeded

You got this error and you weren’t doing anything that bandwidth intensive? Say, you were just viewing a big image, not a zillion images? It’s jupyter being conservative in version 5.0:

jupyter notebook --generate-config
atom ~/.jupyter/jupyter_notebook_config.py

update the c.ServerApp.iopub_data_rate_limit to be big, e.g. c.ServerApp.iopub_data_rate_limit = 10000000.

This is fixed after 5.0.

13 Securing

Modern jupyter is suspicious of connections per default and will ask you either for a magic token or a password and thereafter, I think, encrypts the connection (?), which is sometimes what I want. Not always.

When I am in HPC hell, accessing jupyter notebooks through a double SSH tunnel, the last thing I need is to put a hat on a hat by triply securing the connection. This is now adding more points of failure without any additional benefit. Also, sometimes the tokens do not work over SSH tunnels for me and I cannot work out why. I think it is something about some particular jupyter version mangling tokens, or possibly failing to report that it has not claimed a port used by someone else (although it happens more often than is plausible for the latter case). CodingMatters notes that the following invocation will disable all jupyter-side security measures:

$ jupyter notebook --port 5000 --no-browser --ip='*' --ServerApp.token='' --ServerApp.password=''

Obviously never do this unless you believe that everyone sharing a network with that machine has your best interests at heart. The best way to ensure that is to be accessing a machine through a firewall to a locked-down port.

There are various other useful settings which one could use to reduce security. In config file format for ~/.jupyter/jupyter_notebook_config.py:

c.ServerApp.disable_check_xsrf = True #irritates ssh tunnel for me that one time
c.ServerApp.open_browser = False # consumes a 1 time token and is pointless from a headless HPC
c.ServerApp.use_redirect_file = False # forces display of token rather than writing it to some file that gets lost in the containerisation and is useless in headless HPC
c.ServerApp.allow_password_change = True # Allow password setup somewhere sensible.
c.ServerApp.token = '' # no auth needed
c.ServerApp.password = password # actually needs to be hashed - see below

Eric Hodgins recommends this hack for a simple password without messing about trying to be clever with their browser infrastructure (which TBH does seem to break pretty often for me):

c = get_config()
c.ServerApp.ip = '*'
c.ServerApp.open_browser = False
c.ServerApp.port = 5000

# setting up the password
from IPython.lib import passwd
password = passwd("your_secret_password")
c.ServerApp.password = password

14 As a proxy

jupyter-server-proxy:

Jupyter Server Proxy lets you run arbitrary external processes (such as RStudio, Shiny Server, syncthing, PostgreSQL, etc) alongside your notebook, and provide authenticated web access to them.

Note

This project used to be called nbserverproxy. if you have an older version of nbserverproxy installed, remember to uninstall it before installing jupyter-server-proxy - otherwise they may conflict

The primary use cases are:

  1. Use with JupyterHub / Binder to allow launching users into web interfaces that have nothing to do with Jupyter - such as RStudio, Shiny, or OpenRefine.
  2. Allow access from frontend javascript (in classic notebook or JupyterLab extensions) to access web APIs of other processes running locally in a safe manner. This is used by the JupyterLab extension for dask.

15 References

Chattopadhyay, Prasad, Henley, et al. 2020. “What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities.”
Granger, and Pérez. 2021. Jupyter: Thinking and Storytelling With Code and Data.” Computing in Science Engineering.
Himmelstein, Rubinetti, Slochower, et al. 2019. Open Collaborative Writing with Manubot.” Edited by Dina Schneidman-Duhovny. PLOS Computational Biology.
Millman, and Pérez. 2014. Developing Open Source Scientific Practice.”
Otasek, Morris, Bouças, et al. 2019. Cytoscape Automation: Empowering Workflow-Based Network Analysis.” Genome Biology.
Sokol, and Flach. 2021. You Only Write Thrice: Creating Documents, Computational Notebooks and Presentations From a Single Source.” In.

Footnotes

  1. because in mathematics and computer science if you don’t know what to call something you call it a kernel. This confusing explosion of definitions is very much on-message for the notebook development.↩︎