Concept | Code notebooks#

A code notebook is a tool that allows users to combine executable code and rich text for interactively writing code, performing exploratory analysis, or developing and presenting code projects.

A notebook integrates code and its output into a single document that can combine code, narrative text, visualizations, and other rich media. In other words, in one single document, you can run code, add explanations, display the output, and make your work more transparent.

Dataiku features the following code notebooks for exploratory or experimental work:

SQL notebooks for exploring and manipulating tabular data on SQL databases, which can also be used to perform queries on Impala, Hive, and Spark SQL.
Jupyter notebooks to run Python, R, or Scala code, which can also be executed in Spark.

SQL notebooks#

Each SQL notebook is attached to a single SQL connection configured in Dataiku and allows you to perform queries in an interactive environment.

Note

To use SQL notebooks in Dataiku, you need to have an available SQL connection.

SQL notebooks use SQL queries to interact with the tables that are associated with the Dataiku SQL datasets using this connection or directly on tables that are made available via the connection.

Use cases of SQL notebooks#

SQL notebooks allow you to execute many kinds of SQL statements, from the simplest SQL queries to advanced DDL statements, stored procedures, and more.

SQL notebooks allow you to:

Prototype an analysis over an SQL dataset and return query outputs without having to write them as new datasets in your SQL database.
Leverage an SQL engine for data analysis.

Creating SQL notebooks#

There are two main ways to create an SQL notebook from:

The Notebooks menu, by clicking the + New Notebook button.
The Lab menu of an SQL dataset in the Flow, by clicking New in the Code Notebooks section.

Using SQL notebooks#

SQL notebooks consist of one or more query cells. In each cell, you modify and execute a single query, and then view its results.

Try using a single cell for each query that you might want to rerun later.

Each cell has its own version history so you can work on tuning and debugging your queries, as you’ll always be able to revert to previously executed states.

Jupyter notebooks#

Jupyter notebooks in Dataiku allow you to write and evaluate Python, R, or Scala code interactively.

Note

The screenshots in this article use Python notebooks as an example, but the process is similar for R and Scala notebooks as well.

Creating Jupyter notebooks#

Similar to SQL notebooks, you can create Jupyter notebooks from:

The Notebooks menu.
The Lab menu of a dataset.

In both cases, Dataiku prompts you to enter a name for the notebook, select a code environment, and choose a starter template depending on your use case. For example, for Python notebooks, Dataiku provides starter templates for reading a dataset in memory and using PySpark.

The starter code in Python notebooks generally contains a few import statements for the dataiku and pandas packages, as well as sample code for loading a Dataiku dataset as a dataframe.

If you create a notebook from the Lab menu of a dataset, the starter code recognizes and loads this specific dataset as a dataframe.

Once you have created a notebook, you can write your code in the same way that you would if you were using Jupyter notebooks outside of Dataiku. In addition, Dataiku provides multiple code samples to help you get started.

Jupyter kernels#

To successfully manage multiple Jupyter notebooks across projects, it’s important to keep the notion of kernels in mind. A Jupyter kernel is a specific process created each time that a user opens a Jupyter notebook and which holds the computational state of the notebook.

The user can also change the kernel, which is equivalent to changing the code environment that the notebook is using.

Note

You can see the name of the kernel (or code environment) that’s currently in use in the upper right corner of the notebook’s navigation bar.

When the user navigates away from the notebook, the kernel remains alive, which ensures that long running computation continues without having to always keep the notebook open. If left unchecked, however, this process can be computationally inefficient.

This is why it’s important to stop kernels when it’s not necessary to keep them alive.

Users can stop their kernels by selecting a Jupyter notebook from the Notebooks menu and clicking the Unload button. This destroys the process running the code and all its state, but it preserves the code itself in the notebook.

Administrators can list and stop the kernel of any Jupyter notebook in Dataiku from the Administration menu, by navigating to the Monitoring section and then opening Background tasks.

Administrators can also choose to automatically kill Jupyter kernels after a certain number of days using a Macro.

Both administrators and users with appropriate access can also kill running notebooks programmatically, using the public Dataiku API.

Deploying a Jupyter notebook as a code recipe#

Code notebooks allow users to interactively work with code in the exploratory stage. However, once you are ready to deploy the code into production, it needs to be convertible from a notebook to a script.

Dataiku makes this easy, by allowing you to deploy a Jupyter notebook as a code recipe.

Additionally, you can also open and edit code recipes as notebooks, experiment and test your changes, and then deploy them back to the recipe.

All this makes for a smooth two-way navigation between experimental and production work.

Next steps#

This article presented the two main types of code notebooks in Dataiku: SQL and Jupyter notebooks.

To learn more, you can follow tutorials on: