Tutorial | Code notebooks (Developer part 1)¶
Jupyter notebooks are a favorite tool of many data scientists. They provide users with an ideal environment for interactively analyzing datasets directly from a web browser, combining code, graphical output, and rich content in a single place.
Before beginning this tutorial, you may wish to review a code notebook concept article.
In this tutorial, you will:
Create, edit, publish, and unload Jupyter notebooks in Dataiku.
This tutorial only covers the use of Jupyter notebooks in Dataiku. You can find a separate tutorial on SQL notebooks.
The first step is to create a new Dataiku Project. You will work with a sample project containing data from the fictional Haiku T-Shirt company.
From the homepage, click +New Project > DSS tutorials > Developer > Code Notebooks.
You can also download the starter project from this website and import it as a zip file.
Given their usefulness for data science, Jupyter notebooks are natively embedded in Dataiku and tightly integrated with other components, which makes them easy to use in various ways.
Create a Jupyter notebook¶
Depending on your objectives, you can create a Jupyter notebook in Dataiku in a number of different ways. In this exercise, we will create a notebook from a dataset, which simplifies reading in the dataset of interest using the Dataiku API.
From the Flow, select the orders dataset.
In the right panel, in the Actions tab, click the Lab menu (with the microscope icon).
Under Code Notebooks dropdown, click New.
From the notebook options, select Python.
Name the notebook
orders analysisand click Create, leaving the default option to read the dataset in memory using Pandas.
Edit and run code in a notebook¶
The newly created notebook contains some useful starter code:
The first cell uses the built-in magic commands to import the numpy and matplotlib packages.
The second cell imports other useful packages.
The third cell reads in the orders dataset and converts it to a Pandas dataframe.
The fourth cell contains a function that performs some basic analysis on the columns of the dataset.
The starter code of a notebook created from a dataset will have already read in the chosen dataset to a
df variable, whether it may be a Pandas, R, or Scala dataframe.
You can edit the starter code as well as write your own code in the same way you would outside of Dataiku.
In this very simple exercise, we will slightly modify the existing starter code:
limit=100000from the second line of code in the third cell to remove the default dataset sampling. After removing it, the line of code should look like this:
df = dataset_orders.get_dataframe()
df.head()right under the one above. The code in the third cell should now look like this:
# Read the dataset as a Pandas dataframe in memory # Note: here, we only read the first 100K rows. Other sampling options are available dataset_orders = dataiku.Dataset("orders") df = dataset_orders.get_dataframe() df.head()
Run the first three cells to read in the orders dataset and display the head, or the first 5 rows of the dataset, by default.
Run the fourth and last cell (
pdu.audit(df)), which is part of the starter code, to display some basic information about the columns of the orders dataset.
Click the Save button (or use the shortcut
Sfor Mac) to save your progress.
It’s also possible to create Jupyter notebooks from machine learning models. For more information, consult the reference documentation.
Publish a Jupyter notebook to a dashboard¶
For a collaborative platform like Dataiku, the ability to share work and analyses is of high importance. Dataiku allows you to save static exports (non-interactive snapshots) of Jupyter notebooks in an HTML format, which can be shared on dashboards.
To share the notebook on a dashboard:
Click Publish from the Actions menu of the notebook and indicate the dashboard and slide where it should appear.
In the Tile tab of the dashboard’s Edit tab, select the Show code checkbox to display the code cells.
Save your changes, then navigate to the View tab to see how the notebook insight appears on the dashboard.
Finally, once you’re done working in a Jupyter notebook for the time being, you can optimize its computational efficiency by killing the kernel. To do this:
Navigate to the Notebooks page (
Check the box to select the orders analysis notebook.
In the right panel, in the Actions tab, click Unload to kill the kernel.
Jupyter notebooks are first-class citizens in Dataiku. They are in the toolbox of most data scientists, and they make a great environment for interactively analyzing your datasets using Python, R, or Scala.
Often you’ll want to convert code notebooks into code recipes. Take that next step in this tutorial.