Tutorial | R basics with Dataiku#

In this tutorial, we will show you how to:

  • Integrate R as part of your data pipeline through code recipes.

  • Use Jupyter notebooks to prototype and test code.

  • Transfer a Dataiku dataset into an R dataframe and back, using the dataiku R package.

We will work with the fictional retailer Haiku T-Shirt’s data.


This tutorial assumes that you are familiar with the Basics courses.

Technical requirements#

Access to a Dataiku instance that has the R integration installed.

Create your project#

The first step is to create a new Dataiku Project.

  1. From the Dataiku homepage, click +New Project > DSS tutorials > Developer > R in Dataiku.


    You can also download the starter project from this website and import it as a zip file.

  1. Go to the Flow.


In the Flow, you see the Haiku T-Shirt orders and customers_stacked data uploaded into Dataiku. Further, the customers_stacked data has been prepared with a visual Prepare recipe.

Your first R recipe#

Our current goal is to group past orders by customer, aggregating their past interactions. In the Basics courses, we accomplished this with a visual Group recipe, but it can also be easily accomplished with R code.

  1. With the orders dataset selected, choose Actions > Code Recipes > R.

  2. Add a new output dataset named orders_by_customer.

  3. Click Create Recipe.

The recipe is now populated with the following code, which reads the orders dataset into an R dataframe named orders, passes it unchanged to a new dataframe named orders_by_customer, and writes that new dataframe out to the orders_by_customer dataset.


# Recipe inputs
orders <- dkuReadDataset("orders", samplingMethod="head", nbRows=100000)

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a R dataframe or data table
orders_by_customer <- orders # For this sample code, simply copy input to output

# Recipe outputs

As the commented TODO says, we’ll need to provide the code that aggregates the orders by customer. Dataiku provides a number of code samples to help get us started.

  1. Search for Group By in the code samples.

  2. Click +Insert on the Group on one column sample to replace the line where orders_by_customer is defined.

  3. Edit the code to apply to our data:

orders %>%
  group_by(customer_id) %>%
  summarize(mean(pages_visited), sum(tshirt_quantity*tshirt_price)) ->

This creates a dataframe named orders_by_customer with rows grouped by customer_id. For each customer, we’ve computed the average number of pages on the Haiku T-shirt website visited by the customer during orders, and the sum total of the value of orders made by the customer, where the value of each order is the price of each t-shirt multiplied by the number of t-shirts purchased.

An important thing to note about this code is that it uses functions from the dplyr package.

  1. Add a library(dplyr) statement at the top of the recipe.

  2. Run the recipe and explore the output dataset.


    The names for the computed columns are descriptive, but sum(tshirt_quantity * tshirt_price) could be simplified to total. Let’s fix this.

  3. Click Parent Recipe in the orders_by_customer dataset to reopen the recipe.

  4. Click Edit in Notebook to open a Jupyter notebook where we can interactively test the recipe code.

    The recipe code begins in a single cell.

  5. Split the cell so that the code to write recipe outputs is in a separate cell.

  6. Add a cell between the two existing cells, and put the following code in it.

  7. To change the name of the computed column, add total= to the code that defines the dataframe so that it looks like the following.

    orders %>%
      group_by(customer_id) %>%
      summarize(mean(pages_visited), total=sum(tshirt_quantity*tshirt_price)) ->
  8. Run the first two cells in the notebook to verify the new column name.

  9. Click Save Back to Recipe and run the recipe again.

Now the output dataset contains a total column.


Explore with an R notebook#

Previously, we started with an R recipe because we had a specific goal of transforming the orders dataset. If we don’t have a dataset transformation goal in mind, we can explore the data using a notebook.

Create an R notebook#

  1. Select the customer_stacked_prepared dataset.

  2. Click Lab > New Code Notebook > R.

  3. Read the dataset in an R dataframe; click Create.


    The notebook is automatically populated with two cells.


    The first cell imports the dataiku package.


    The second cell reads the customers_stacked_prepared dataset into a dataframe named df.

    # Read the dataset as a R dataframe in memory
    # Note: here, we only read the first 100K rows. Other sampling options are available
    df <- dkuReadDataset("customers_stacked_prepared", samplingMethod="head", nbRows=100000)
  4. Run each of the cells in order.


    The notebook now has the dataframe df ready in memory.

  5. Copy and then run the following code in a new cell in the notebook.

    count(df, campaign)

    It returns the number of customers who are part of the marketing campaign and the number who aren’t.

Now we’d like to visualize the effect of campaign on the total amount a customer has spent. Since that information is in the orders_by_customer dataset, we’ll need to read that dataset into a new dataframe.

  1. In a new cell, copy the following code.

    df_orders <- dkuReadDataset("orders_by_customer")
  2. Run it.

Now join it with the df dataframe. As in the R recipe, Dataiku provides helpful code samples.

  1. Add a new cell.

  2. Search the code samples for Join data.frames.

  3. Copy the code for Conduct a left-join between two data.frames to the notebook cell.

  4. Modify it to apply to our data like the code below.

  5. Run the cell.

df %>%
  left_join(df_orders, by = c("customerID" = "customer_id")) ->

Finally, the following code produces a paneled histogram with the bar heights normalized so that it’s easier to compare across values of campaign.

  1. Add a new cell.

  2. Copy the contents below and run the cell.

    ggplot(customers_enriched, aes(total)) +
      geom_histogram() +
      facet_grid(. ~ campaign)

Create a recipe from the code notebook#

Recall that the notebook is a lab environment, and so the join we performed between the dataframes isn’t reflected in the Flow until we create a recipe.

  1. From within the notebook, click Create Recipe > R recipe.

  2. Add orders_by_customer as an input.

  3. Create a new output dataset called customers_enriched.

  4. Run the resulting recipe and see how the Flow is affected.


What’s next?#

Congratulations! You’ve taken the first steps on R integration in Dataiku. As you progress, you’ll find that the use of R in Dataiku is extensible. You can create: