Tutorial | Recipe plugin component example#

Get started#

By writing a custom recipe, you can add a new kind of recipe to Dataiku. The idea is:

You write the core of the recipe in Python or R code.
You write a JSON descriptor that declares:
- The kinds of inputs and outputs of the recipe.
- The available configuration parameters.
In the Python or R code of the recipe, you use a specific API to retrieve the inputs, outputs and parameters (i.e., the “instantiation parameters”) of the recipe.

To the user, the custom recipe is a visual recipe in which they can enter the declared configuration parameters and run the recipe.

Objectives#

Let’s write a custom recipe that computes pairwise correlations (i.e., correlations between the values in pairs of columns). Such a recipe could be used, for example, to discover that the price of a car has a strong negative correlation with the mileage.

We will start by writing a Python recipe in the Flow of the tutorial project, and then make it “reusable”.

Prerequisites#

In this tutorial, we’ll use the plugin we created in the introduction and add a custom recipe component to it.

Create the project#

From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select Your first plugin.
If needed, change the folder into which the project will be installed, and click Install.
From the project homepage, click Go to Flow (or type g + f).

From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by Developer.
Select Your first plugin.
From the project homepage, click Go to Flow (or type g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Create the base recipe#

Create a Python recipe with the wine_quality dataset as an input and a new wine_correlation dataset as the output.

The recipe code should look like the following:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np

# Read the input
input_dataset = dataiku.Dataset("wine_quality")
df = input_dataset.get_dataframe()
column_names = df.columns

# We'll only compute correlations on numerical columns
# So extract all pairs of names of numerical columns
pairs = []
for i in range(0, len(column_names)):
    for j in range(i + 1, len(column_names)):
        col1 = column_names[i]
        col2 = column_names[j]
        if df[col1].dtype == "float64" and \
           df[col2].dtype == "float64":
            pairs.append((col1, col2))

# Compute the correlation for each pair, and write a
# row in the output array
output = []
for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    output.append({"col0" : pair[0],
                   "col1" : pair[1],
                   "corr" :  corr})

# Write the output to the output dataset
output_dataset =  dataiku.Dataset("wine_correlation")
output_dataset.write_with_schema(pd.DataFrame(output))

Run the recipe and see the output: a dataset with 3 columns (col0, col1, corr) and one row per input columns pair.

Convert it to a custom recipe#

To make this Python recipe a custom recipe:

Click Actions.
Choose Convert to plugin.
Select Existing dev plugin.
Choose first-plugin as the Plugin id.
Type compute-correlation as the New plugin recipe id.
Click Convert.

Dataiku generates the custom recipe files and suggests we edit them now in the Plugin Developer. Let’s do that now.

For the rest of the tutorial, we’ll tweak the generated files.

Note

Dataiku stores the generated files under the Plugin id. You can always find the plugins you developed by visiting the application menu and selecting Plugins > Development.

Edit definitions in recipe.json#

First, let’s have a look at the recipe.json file. The most important things to change are the inputRoles and outputRoles arrays. Roles allow you to associate one or more datasets to each kind of input and output of the recipe.

Our recipe is a simple one: it has one input role with exactly 1 dataset, and one output role with exactly 1 dataset. Edit your JSON to look like:

"inputRoles" : [
    {
        "name": "input",
        "label": "Input dataset",
        "description": "The dataset containing the raw data from which we'll compute correlations.",
        "arity": "UNARY",
        "required": true,
        "acceptsDataset": true
    }
],

"outputRoles" : [
    {
        "name": "main_output",
        "label": "Output dataset",
        "description": "The dataset containing the correlations.",
        "arity": "UNARY",
        "required": true,
        "acceptsDataset": true
    }
],

We’d like to allow users of this plugin to be able to focus on “strong” correlations (i.e., values that are closest to +1 or -1).

We can specify a threshold parameter that can be set in the recipe dialog by editing the params section of recipe.json:

"params": [
    {
        "name": "threshold",
        "label" : "Threshold for showing a correlation",
        "type": "DOUBLE",
        "defaultValue" : 0.5,
        "description":"Correlations below the threshold will not appear in the output dataset",
        "mandatory" : true
    }
],

Edit code in recipe.py#

Now let’s edit recipe.py. The default contents include some generic starter code for referencing roles and parameters, the code from your Python recipe, and some comments that explain how to finish creating your custom recipe. In the end, your recipe.py should start with code for retrieving datasets and parameters like:

# Retrieve array of dataset names from 'input' role, then create datasets
input_names = get_input_names_for_role('input')
input_datasets = [dataiku.Dataset(name) for name in input_names]

# For outputs, the process is the same:
output_names = get_output_names_for_role('main_output')
output_datasets = [dataiku.Dataset(name) for name in output_names]

# Retrieve parameter values from the of map of parameters
threshold = get_recipe_config()['threshold']

The portion of your original recipe that reads inputs needs to be updated to refer to the datasets created from the input roles, like:

# Read the input
input_dataset = input_datasets[0]
df = input_dataset.get_dataframe()
column_names = df.columns

The portion of your original recipe that computes the correlations should be updated to include the threshold to filter out the weak correlations:

for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    if np.abs(corr) > threshold:
      output.append({"col0" : pair[0],
                     "col1" : pair[1],
                     "corr" :  corr})

The portion of your original recipe that writes the output datasets also needs to be updated to refer to the datasets created from the output roles, like:

# Write the output to the output dataset
output_dataset =  output_datasets[0]
output_dataset.write_with_schema(pd.DataFrame(output))

Verify that wine_quality and wine_correlation don’t appear anymore in your recipe. In general, the rest of recipe.py can be left as-is.

Use your custom recipe in the Flow#

Note

After editing recipe.json for a custom recipe, you must do the following:

Click Reload.
Reload the Dataiku page in your browser.

When modifying the recipe.py file, you don’t need to reload anything. Simply run the recipe again.

Go to the Flow.
Click + Recipe and select your plugin recipe. The usual recipe creation tab appears.
Select the wine_quality input dataset.
Create a new output dataset.
Run the recipe, editing the default threshold value if you desire.

Congratulations, you have created your first custom visual recipe!