Architecting an Environment to Share and Collaborate with Jupyter Notebooks

Jupyter Notebooks are very useful for developing (and sharing) data analytics. In addition, its flexibility allows it to be used for much more than that – teaching materials, self-learning programming languages, and (re)publication of academic papers and ebooks are other interesting uses.

A while back, I helped architect and implement a collaborative environment that allowed data scientists to collaborate using Jupyter Notebooks. I would like to take this post to share the thought process behind the solution, as well as other factors that were taken into consideration.

Prerequisites

  • You should have basic understanding of a Jupyter Notebook
  • Bonus: Hands on experience with Jupyter Notebook
  • Bonus: You know how Jupyter Notebooks are stored on disk (answer: its a JSON file)
  • Pro tip: Head on to the official site (http://jupyter.org/) to find out more

Assumptions

  • This is not a “how to” post on setting up Jupyter Notebook
  • It is thought process on how I reached the final architecture
  • It will not cover Jupyter Notebook best practices – check out svds.com blog for a good post about it

Background

It helps to think of Jupyter Notebook as four logical components:

  1. Viewer – A web browser used to view/interact with Jupyter Notebooks
  2. Jupyter Notebook Application – Serves out notebooks, fronted by a web server
  3. Python Kernel – Enables code to be executed within the notebooks
  4. Data Storage – Medium used to store Jupyter Notebooks

Thought Process

Step 1: Just the Viewer

Maybe it is your first time reading about Jupyter Notebooks, and you would like to see what the hype is about. So you navigate to a notebook viewer application on the web (such as https://nbviewer.jupyter.org/) to take a look.

You enter a notebook/repository URL, and the Jupyter Notebook application reads the notebook JSON and parses it into nicely outlined cells, plots, and text. This is the most basic and simplest version.

  1. Viewer – Local Web Browser
  2. Jupyter Notebook Application – Remotely hosted
  3. Python Kernel – Likely not installed
  4. Data Storage – Remotely hosted

 

Step 2: Single User Jupyter Notebook Server (for analytics development)

You see the potential in Jupyter Notebook, and you would like to try developing your own analytics on it. You follow the documentation, install it on your laptop, start the server, and navigate to localhost:8888. You can now create notebooks, and execute arbitrary python code (e.g. print(“hello world”)).

  1. Viewer – Local Web Browser
  2. Jupyter Notebook Application – Localhost
  3. Python Kernel – Localhost
  4. Data Storage – Localhost

 

Step 3: Decentralized Single User Jupyter Notebook Server (for analytics development) + Remote Storage

Your team director buys the idea of using Jupyter Notebook as a new way to perform data analytics. Everyone in the team installs Single User Jupyter Notebooks on their laptops and notebooks are passed around using external USB drives.

Just kidding, you’re more savvy than that. So you set up shared folder(s) on the team server and everyone on the same project saves their work there.

  1. Viewer – Local Web Browser
  2. Jupyter Notebook Application – Localhost of every machine
  3. Python Kernel – Localhost of every machine
  4. Data Storage – Remote team server

 

Step 4: Decentralized Single User Jupyter Notebook Server (for analytics development) + Remote Storage with Version Control

Eventually, people start writing over other people’s work, especially when they are working on the same “final” notebook version. Some kind of source control and versioning is required. Bring in the usual names like GitHub, GitLab, BitBucket, etc.

  1. Viewer – Local Web Browser
  2. Jupyter Notebook Application – Localhost of every machine
  3. Python Kernel – Localhost of every machine
  4. Data Storage – Remote source code repository

 

Step 5: Centralized Single User Jupyter Notebook Public Server (for analytics development) + Remote Storage with Version Control

Having different diversity in development environments eventually create the “works on my machine” problem of collaboration. Different versions of the python kernel (e.g. 2.6 vs 3.3) and/or Jupyter Notebook Application eventually breakdown the seamless collaboration and integration processes.

The quickest solution is to then have one centralized public server where multiple users can login and work on their own notebooks.

  1. Viewer – Local Web Browser
  2. Jupyter Notebook Application – Remote on centralized server
  3. Python Kernel – Remote on centralized machine
  4. Data Storage – Remote source code repository

Advantage: Thin clients (only a web browser), and with faster onboarding (less complexity)

Disadvantage: All the notebooks are visible to everyone (no permissions control), unable to add/remove new python packages without sysadmin help and/or risk of breaking something

The impulsive solution would be to then have multiple public servers, but the data scientists would then have to remember many passwords and juggle between servers. And it still does not solve the sysadmin part, unless the data scientists dual-hat roles.

 

Step 6: Decentralized Single User Jupyter Notebook (for analytics development) + Centralized Single User Jupyter Notebook Public Server (for “production” notebooks) + Remote Storage with Version Control

We need the decentralized flexibility for data scientists to develop their own analytics without being hindered by their tools (Step 4), but also a “production” environment for collaboration and integration (Step 5). A hybrid architecture comprising of both steps is born

Data scientists develop, and collaborate on, analytics on their local machines. When a notebook is “finished”, the team leader uploads the production version onto the centralized public server for everyone to view (e.g. internal stakeholders, other teams).

With regards to data storage, data scientists use the source code repository to manage their collaborations (e.g. branch/fork/merge notebooks), while the centralized public server uses it as a remote backup.

External team processes will be required to standardize analytics packages, and software versions used.

  1. Viewer – Local Web Browser
  2. Jupyter Notebook Application – Localhost of every machine, and one centralized team public server
  3. Python Kernel – Localhost of every machine, and one on centralized team public server
  4. Data Storage – Remote source code repository

 

Step 7: Multi-User Jupyter Notebook Server + Remote Storage with Optional Version Control

External processes may be challenging to manage, but it can be enforced by means of a system with the appropriate settings. A multi-user Jupyter Notebook application/platform would do just that – and that’s where JupyterHub can fill the gap.

JupyterHub provides a centralized multi-user environment that allows for concurrent users to develop Jupyter Notebooks. It dynamically spins up/down Single User Jupyter Notebook servers for each connecting user, granting them a private notebook server instance with their existing notebooks.

Data scientists will use their own accounts for analytics development, and a special account will be set up to store production version notebooks.

With regards to data storage, each user’s home folder can be cron’ed to do a daily push/sync to the remote repository as a backup. If version control is not critical, it can be omitted.

At the time of posting, JupyterHub is still in development version 0.8.

Advantage: Thin clients (with centralized model), ease of integration (all package/software versions are the same)

Disadvantage: Unable to add/remove new python packages without sysadmin help and/or risk of breaking something (but this may not be relevant to data scientists), complexity to manage and set up the platform

  1. Viewer – Local Web Browser
  2. JupyterHub Application – Remotely hosted
  3. Python Kernel – Remotely hosted
  4. Data Storage – Remote storage (e.g. cloud), or source code repository

 

Summary

Hope this information will be helpful to someone down the road.

I admit that this may not be the most effective way to implement it. Feel free to share your thoughts and considerations as well 🙂

Advertisements

2 thoughts on “Architecting an Environment to Share and Collaborate with Jupyter Notebooks

  1. Hey Luppeng! This is a great write up. I really like the thought process between transitions.
    Do you have any ideas of how to implement the final step? I’m really interested in finding out what the architecture for that system may look like.

    Like

    1. Hi Nayana Anil – good to hear from you, and I am really glad that you enjoyed the write up.

      Regarding the final step, I assume you are referring to Step 7 with JupyterHub? Its documentation (https://jupyterhub.readthedocs.io/en/latest/getting-started.html) has a really cool diagram showing the subsystems that make up its architecture. The “Quickstart – Installation” section contains the necessary steps to help get it installed on your machine.

      An alternative I considered previously was to run individual Jupyter Notebook instances in a multi-tenancy PaaS environment (e.g. VMs on an enterprise server/cloud, Docker containers). While this method allows for fine-grained control, additional technical skills are needed to manage, operate, and administer the PaaS.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s