Hands-On Tutorial: Code Notebooks¶
Jupyter notebooks are a favorite tool of many data scientists. They provide users with an ideal environment for interactively analyzing datasets directly from a web browser, combining code, graphical output, and rich content in a single place.
In this tutorial, you will learn how to create, edit, and publish Jupyter notebooks in Dataiku.
This hands-on only covers the use of Jupyter notebooks in Dataiku. To learn about using SQL notebooks, read this tutorial.
Let’s Get Started!¶
You will work with a sample project containing data from the fictional Haiku T-Shirt company.
Some familiarity with the basics of Dataiku (we recommend having completed the Basics courses);
Some familiarity with coding in Python and using Jupyter notebooks.
Create Your Project¶
The first step is to create a new Dataiku Project.
From the homepage, click +New Project > DSS Tutorials > Developer > Code Notebooks (Tutorial).
You can also download the starter project from this website and import it as a zip file.
Create and Use Jupyter Notebooks¶
Given their usefulness for doing data science, Jupyter notebooks are natively embedded in Dataiku, and tightly integrated with other components, which makes them easy to use in various ways.
Create a Jupyter Notebook¶
Depending on your objectives, you can create a Jupyter notebook in Dataiku in a number of different ways. In this exercise, we will create a notebook from a dataset, which simplifies reading in the dataset of interest using the Dataiku DSS API.
From the Flow, select the orders dataset.
In the right sidebar, locate and click the Lab menu (with the microscope icon).
Under “Code Notebooks”, click New.
From the notebook options, select Python.
Name the notebook
orders analysisand click Create, leaving the default option to read the dataset in memory using Pandas.
Edit and Run Code in a Notebook¶
The newly created notebook contains some useful starter code:
The first cell uses the built-in magic commands to import the numpy and matplotlib packages.
The second cell imports other useful packages.
The third cell reads in the orders dataset and converts it to a Pandas dataframe.
The fourth cell contains a function that performs some basic analysis on the columns of the dataset.
The starter code of a notebook created from a dataset will have already read in the chosen dataset to a
df variable, whether it may be a Pandas, R, or Scala dataframe.
You can edit the starter code as well as write your own code in the same way you would outside of Dataiku.
In this very simple exercise, we will slightly modify the existing starter code:
limit=100000from the second line of code in the third cell to remove the default dataset sampling.
After removing it, the line of code should look like this:
df = dataset_orders.get_dataframe()
Add the following line of code right under the one above:
The code in the third cell should now look like this:
# Read the dataset as a Pandas dataframe in memory # Note: here, we only read the first 100K rows. Other sampling options are available dataset_orders = dataiku.Dataset("orders") df = dataset_orders.get_dataframe() df.head()
Run the first three cells to read in the orders dataset and display the “head”, or the first few rows of the dataset.
Run the fourth and last cell (
pdu.audit(df)), which is part of the starter code, to display some basic information about the columns of the orders dataset.
Click the Save button (or use the shortcut
Sfor Mac) to save your progress.
It’s also possible to create Jupyter notebooks from machine learning models. For more information, consult the product documentation.
Publish a Jupyter Notebook to a Dashboard¶
For a collaborative platform like Dataiku, the ability to share work and analyses is of high importance. Dataiku allows you to save static exports (non-interactive snapshots) of Jupyter notebooks in an HTML format, which can be shared on dashboards.
To share the notebook on a dashboard:
Click Publish from the Actions menu of the notebook and indicate the dashboard and slide where it should appear.
By default, only the printed outputs of the notebook appear in the published insight.
In the right sidebar of the dashboard’s Edit tab, select the Show code checkbox to display the code cells.
Save your changes, then navigate to the View tab to see how the notebook insight appears on the dashboard.
Unload a Notebook¶
Finally, once you’re done working in a Jupyter notebook for the time being, you can optimize its computational efficiency by killing the kernel. To do this:
Navigate to the Notebooks page (
orders analysisnotebook, and click Unload in the right sidebar.
Jupyter notebooks are first-class citizens in Dataiku. They are in the toolbox of most data scientists, and they make a great environment for interactively analyzing your datasets using Python, R, or Scala.
To learn more about notebooks in Dataiku, you can: