Concept: Code Notebooks in Dataiku¶
A code notebook is a tool that allows users to combine executable code and rich text for interactively writing code, performing exploratory analysis, or developing and presenting code projects.
A notebook integrates code and its output into a single document that can combine code, narrative text, visualizations, and other rich media. In other words, in one single document, you can run code, add explanations, display the output, and make your work more transparent.
Dataiku features the following code notebooks for exploratory or experimental work:
SQL notebooks for exploring and manipulating tabular data on SQL databases, which can also be used to perform queries on Impala, Hive and Spark SQL.
Jupyter notebooks to run Python, R or Scala code, which can also be executed in Spark.
An SQL notebook is attached to a single SQL connection configured in Dataiku and allows you to perform queries in an interactive environment.
In order to use SQL notebooks in Dataiku, you need to have an available SQL connection.
An SQL Notebook uses SQL queries to interact with the tables that are associated with the DSS SQL datasets using this connection or directly on tables that are made available via the connection.
Use Cases of SQL Notebooks¶
SQL notebooks allow you to execute many kinds of SQL statements, from the simplest SQL queries to advanced DDL statements, stored procedures, and more.
SQL notebooks allow you to:
Quickly prototype an analysis over an SQL dataset and return query outputs without having to write them as new datasets in your SQL database.
Leverage an SQL engine for data analysis.
Create an SQL Notebook¶
There are two main ways to create an SQL notebook:
from the Notebooks menu, by clicking the + New Notebook button; or
from the Lab menu of an SQL dataset in the Flow, by clicking New in the Code Notebooks section.
Use an SQL Notebook¶
An SQL notebook is made of one or more query cells. In each cell, you modify and execute a single query, and then view its results.
The recommended way to work with an SQL notebook is to keep a single cell for each query that you might want to rerun later.
Each cell has its own version history so you can safely work on tuning and debugging your queries, as you’ll always be able to revert to previously executed states.
Jupyter notebooks in Dataiku allow you to write and evaluate Python, R, or Scala code interactively.
The screenshots in this article use Python notebooks as an example, but the process is similar for R and Scala notebooks as well.
Create a Jupyter Notebook¶
Similarly to SQL notebooks, Jupyter notebooks can be created:
from the Notebooks menu; or
from the Lab menu of a dataset.
In both cases, Dataiku prompts you to enter a name for the notebook, select a code environment, and choose a starter template depending on your use case. For example, for Python notebooks, Dataiku provides starter templates for reading a dataset in memory and using PySpark.
The starter code in Python notebooks generally contains a few import statements for the dataiku and pandas packages, as well as sample code for loading a DSS dataset as a dataframe.
If you create a notebook from the Lab menu of a dataset, the starter code recognizes and loads this specific dataset as a dataframe.
Once your notebook is created, you can write your code in the same way that you would if you were using Jupyter notebooks outside of Dataiku. In addition, Dataiku provides multiple code samples to help you get started.
In order to successfully manage multiple Jupyter notebooks across projects, it’s important to keep the notion of kernels in mind. A Jupyter kernel is a specific process created each time that a user opens a Jupyter notebook and which holds the computational state of the notebook.
The user can also change the kernel, which is equivalent to changing the code environment that the notebook is using.
You can see the name of the kernel (or code environment) that is currently being used in the upper right corner of the notebook’s navigation bar.
When the user navigates away from the notebook, the kernel remains alive, which ensures that long running computation continues without having to always keep the notebook open. If left unchecked, however, this process can be quite computationally inefficient.
This is why it’s important to stop kernels when it’s not necessary for them to be kept alive.
Users can stop their kernels by selecting a Jupyter notebook from the Notebooks menu and clicking the Unload button. The process running the code and all its state is destroyed, but the code itself in the notebook is preserved.
Administrators can list and stop the kernel of any Jupyter notebook in Dataiku from the Administration menu, by navigating to the Monitoring section and then opening Background tasks.
Administrators can also choose to automatically kill Jupyter kernels after a certain number of days using a Macro.
Both administrators and users with appropriate access can also kill running notebooks programmatically, using the public Dataiku API.
Deploy a Jupyter Notebook as a Code Recipe¶
Code notebooks allow users to interactively work with code in the exploratory stage. However, once the code is ready to be deployed to production, it needs to be easily convertible from a notebook to a script.
Dataiku makes this easy, by allowing you to deploy a Jupyter notebook as a code recipe.
Additionally, you can also open and edit code recipes as notebooks, experiment and test your changes, and then deploy them back to the recipe.
All of this makes for a smooth two-way navigation between experimental and production work.