Tutorial | Python basics with Dataiku#

Dataiku allows you to seamlessly integrate Python code and visual recipes in a Flow.

In this tutorial, we will show you how to:

  • Integrate Python as part of your data pipeline through code recipes.

  • Use Jupyter notebooks to prototype and test code.

  • Transfer a Dataiku dataset into a pandas dataframe and back, using the dataiku Python package.

We will work with the fictional retailer Haiku T-Shirt’s data.

Prerequisites#

This tutorial assumes that you are familiar with the Dataiku Basics courses.

Create your project#

The first step is to create a new Dataiku Project.

  1. From the homepage, click + New Project > DSS tutorials > Developer > Code Notebooks.

Note

You can also download the starter project from this website and import it as a zip file.

../../_images/tshirt-python-flow-011.png

In the Flow, you see the Haiku T-Shirt orders and customers_stacked data uploaded into Dataiku. Further, the customers_stacked data has been prepared with a visual Prepare recipe.

Your first Python recipe#

Our current goal is to group past orders by customer, aggregating their past interactions. In the Basics courses, we accomplished this with a visual Group recipe, but it can also be easily accomplished with Python code.

  1. With the orders dataset selected, choose Actions > Code Recipes > Python.

  2. Add a new output dataset named orders_by_customer.

  3. Click Create Recipe.

The recipe editor is now populated with the following code, which:

  • Reads the orders dataset into a pandas dataframe named orders_df.

  • Passes it unchanged to a new dataframe named orders_by_customer_df.

  • Writes that new dataframe out to the orders_by_customer dataset.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
orders = dataiku.Dataset("orders")
orders_df = orders.get_dataframe()

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: Dataiku also supports other kinds of APIs for reading and writing data. Please see doc.

orders_by_customer_df = orders_df # For this sample code, simply copy input to output

# Write recipe outputs
orders_by_customer = dataiku.Dataset("orders_by_customer")
orders_by_customer.write_with_schema(orders_by_customer_df)

As the commented TODO says, we’ll need to provide the code that aggregates the orders by customer. Dataiku provides a number of code samples to help get us started.

../../_images/tshirt-python-recipe-codesamples-01.png
  1. In this case, replace the line that defines orders_by_customer_df with the following code.

    orders_by_customer_df = orders_df.assign(total=orders_df.tshirt_price*orders_df.tshirt_quantity
       ).groupby(by="customer_id"
                ).agg({"pages_visited":"mean",
                       "total":"sum"})
    

    This creates a dataframe with rows grouped by customer_id. For each customer, we’ve computed the average number of pages on the Haiku T-shirt website visited by the customer during orders, and the sum total of the value of orders made by the customer, where the value of each order is the price of each t-shirt times the number of t-shirts purchased.

  2. Run the recipe, and when it completes, explore the output dataset.

    Interestingly, the output dataset does not contain the customer_id column, but we’ll need it later in order to join the customers and grouped orders data.

    ../../_images/tshirt-python-orders_by_customer-01.png
  3. To diagnose the issue, in the orders_by_customer dataset click Parent Recipe to reopen the Python recipe.

  4. Click Edit in Notebook.

This opens a Jupyter notebook with the recipe code, where we can interactively test it. The recipe code begins in a single cell.

  1. Split the cell so that the code to write recipe outputs is in a separate cell.

  2. Add a cell between the two existing cells and put the following code in it.

    orders_by_customer_df.head()
    
  3. Run the first two cells.

    The output shows that the orders_by_customer dataframe has the customer_id information; however, the dataframe has a hierarchical index.

    ../../_images/tshirt-python-orders-notebook-01.png
  4. To flatten the index, add .reset_index() to the code that defines the dataframe so that it looks like the following.

    orders_by_customer_df = orders_df.assign(total=orders_df.tshirt_price*orders_df.tshirt_quantity
         ).groupby(by="customer_id"
                  ).agg({"pages_visited":"mean",
                         "total":"sum"}).reset_index()
    
  5. Re-run the first two cells to see how the dataframe has changed.

  6. Click Save back to recipe and run the recipe again.

Now, the output dataset contains a customer_id column.

../../_images/tshirt-python-orders_by_customer-02.png

Explore with a Python notebook#

Previously, we started with a Python recipe because we had a specific goal of transforming the orders dataset. Alternatively, we can explore a dataset using a notebook.

Create a Python notebook#

  1. In the Flow, select the customers_stacked_prepared dataset.

  2. From the Actions tab, click Lab > New Code Notebook > Python.

  3. Read the dataset in memory using Pandas.

  4. Click Create.

    ../../_images/tshirt-python-customers-create-notebook-01.png

    The notebook is automatically populated with four cells.

    ../../_images/tshirt-python-customers-notebook-01.png

    The first two cells import various useful packages. The first cell uses the built-in magics to import the numpy and matplotlib packages.

    %pylab inline
    
    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    

    The third cell reads the customers_stacked_prepared dataset into a pandas dataframe named df.

    # Read the dataset as a Pandas dataframe in memory
    # Note: here, we only read the first 100K rows. Other sampling options are available
    
    dataset_customers_stacked_prepared = dataiku.Dataset("customers_stacked_prepared")
    df = dataset_customers_stacked_prepared.get_dataframe(limit=100000)
    

    The fourth cell audits the columns in the dataframe to provide information on the type of data in the column and the number of unique and missing values.

    # Get some simple descriptive statistics
    pdu.audit(df)
    
  5. Run each of the cells in order to produce the output.

    ../../_images/tshirt-python-customers-notebook-02.png

Note

The notebook also now has the dataframe df ready in memory.

Use code samples#

As in the Python recipe, Dataiku provides code samples to help us get started.

  1. Open up the list of code samples and search for Get statistics (distribution).

    ../../_images/tshirt-python-customers-notebook-samples-01.png
  2. Copy the first code sample my_df["col_0"].value_counts() and paste it into a new cell in the notebook.

  3. Edit the code to be relevant to our data and then run it:

    df["campaign"].value_counts()
    

    The cell returns the number of customers who are part of the marketing campaign and the number who are not. Now we’d like to visualize the effect of campaign on the total amount a customer has spent.

  4. Since that information is in the orders_by_customer dataset, read that dataset into a new dataframe:

    dataset_orders = dataiku.Dataset("orders_by_customer")
    df_orders = dataset_orders.get_dataframe(limit=100000)
    
  5. Join it with the df dataframe:

    • Search the code samples for Join dataframes on columns.

    • Copy the code for On columns with different names to a new notebook cell, and modify it to apply to our data.

    df_joined = df.merge(df_orders,left_on="customerID",right_on="customer_id")
    

Finally, the following code produces a paneled histogram with the bar heights normalized so that it’s easier to compare across values of campaign.

pd.pivot_table(df_joined.reset_index(),
  index='index',
  columns='campaign',
  values=['total']).plot.hist(density=True,subplots=True)
../../_images/tshirt-python-customers-notebook-histogram-01.png

Create a recipe from the code notebook#

Recall that the notebook is a lab environment, intended for experimentation. Therefore, the join we performed between the dataframes isn’t reflected in the current Flow as part of our data pipeline until we create a recipe.

../../_images/tshirt-python-customers-flow-01.png
  1. From within the notebook, click Create Recipe > Python recipe.

    It has automatically included the customers_stacked_prepared dataset as an input, but now we’ll want to add one input and name the output.

  2. Add orders_by_customer as an input dataset.

  3. Create a new output dataset called customers_enriched.

    ../../_images/tshirt-python-customers-create-recipe-01.png

    Note

    In the resulting recipe, note that Dataiku adds a number of commented lines, each of which shows the beginning of a notebook cell. This way, if we need to edit the recipe in a notebook again, our existing cells are maintained.

    Dataiku has added lines for the recipe output to the customers_enriched dataset. However, it cannot know which dataframe we want to output. Accordingly:

  4. In the last line, change pandas_dataframe to df_joined.

    ../../_images/tshirt-python-customers-recipe-01.png
  5. Run the recipe and see how the Flow is affected.

    ../../_images/tshirt-python-customers-flow-02.png

What’s next?#

Congratulations! You’ve taken the first steps on Python integration in Dataiku. As you progress, you’ll find that: