Basics of R in Dataiku DSS¶
In this tutorial, we will show you how to:
Integrate R as part of your data pipeline through code recipes
Use Jupyter notebooks to prototype and test code
Transfer a Dataiku dataset into an R dataframe and back, using the
We will work with the fictional retailer Haiku T-Shirt’s data.
This tutorial assumes that you are familiar with the Basics courses.
Create Your Project¶
The first step is to create a new Dataiku DSS Project. From the Dataiku homepage, click +New Project > DSS Tutorials > Code > Tutorial: R in Dataiku DSS. Click on Go to Flow.
In the flow, you see the Haiku T-Shirt orders and customer data uploaded into Dataiku DSS. Further, the customer data has been prepared with a visual Prepare recipe.
Your First R Recipe¶
Our current goal is to group past orders by customer, aggregating their past interactions. In the Basics courses, we accomplished this with a Group visual recipe, but it can also be easily accomplished with R code.
With the orders dataset selected, choose Actions > Code Recipes > R. Add a new output dataset named orders_by_customer. Click Create Recipe.
The recipe form is now populated with the following code, which reads the orders dataset into an R dataframe named orders, passes it unchanged to a new dataframe named orders_by_customer, and writes that new dataframe out to the orders_by_customer dataset.
library(dataiku) # Recipe inputs orders <- dkuReadDataset("orders", samplingMethod="head", nbRows=100000) # Compute recipe outputs from inputs # TODO: Replace this part by your actual code that computes the output, as a R dataframe or data table orders_by_customer <- orders # For this sample code, simply copy input to output # Recipe outputs dkuWriteDataset(orders_by_customer,"orders_by_customer")
As the commented TODO says, we’ll need to provide the code that aggregates the orders by customer. Dataiku provides a number of code samples to help get us started. Search for “group by” in the code samples.
Click +Insert on the “Group on one column” sample to replace the line where
orders_by_customer is defined, and then edit the code to apply to our data:
orders %>% group_by(customer_id) %>% summarize(mean(pages_visited), sum(tshirt_quantity*tshirt_price)) -> orders_by_customer
This creates a dataframe named
orders_by_customer with rows grouped by
customer_id. For each customer, we’ve computed the average number of pages on the Haiku T-shirt website visited by the customer during orders, and the sum total of the value of orders made by the customer, where the value of each order is the price of each t-shirt multiplied by the number of t-shirts purchased.
An important thing to note about this code is that it uses functions from the
dplyr package, so we need to add a
library(dplyr) statement at the top of the recipe for it to run successfully.
Now run the recipe, and when it completes, explore the output dataset. The names for the computed columns are descriptive, but
sum(tshirt_quantity * tshirt_price) could be simplified to
Let’s fix this. Click Parent Recipe in the orders_by_customer dataset to quickly reopen the recipe and then click Edit in Notebook. This opens a Jupyter notebook with the recipe code, where we can interactively test the code.
The recipe code begins in a single cell. Split the cell so that the code to write recipe outputs is in a separate cell. Next, add a cell between the two existing cells and put the following code in it.
In order to change the name of the computed column, add total= to the code that defines the dataframe so that it looks like the following.
orders %>% group_by(customer_id) %>% summarize(mean(pages_visited), total=sum(tshirt_quantity*tshirt_price)) -> orders_by_customer
Run the first two cells in the notebook to verify the new column name, then click Save Back to Recipe and run the recipe again. Now the output dataset contains a total column.
Explore with an R Notebook¶
Previously, we started with an R recipe because we had a specific goal of transforming the orders dataset. If we don’t have a dataset transformation goal in mind, we can explore the data using a notebook.
Select the customer_stacked_prepared dataset and click Lab > New > R notebook. We’ll read the dataset in an R dataframe; click Create.
The notebook is automatically populated with two cells.
The first cell imports the dataiku package.
The second cell reads the customers_stacked_prepared dataset into a dataframe named
# Read the dataset as a R dataframe in memory # Note: here, we only read the first 100K rows. Other sampling options are available df <- dkuReadDataset("customers_stacked_prepared", samplingMethod="head", nbRows=100000)
Run each of the cells in order. The notebook now has the dataframe
df ready in memory.
For now, we’ll write the following code in a new cell in the notebook.
library(dplyr) count(df, campaign)
Run the cell; it returns the number of customers who are part of the marketing campaign and the number who aren’t. Now we’d like to visualize the effect of campaign on the total amount a customer has spent. Since that information is in the orders_by_customer dataset, we’ll need to read that dataset into a new dataframe:
df_orders = dkuReadDataset("orders_by_customer")
… and join it with the
df dataframe. As in the R recipe, Dataiku provides helpful code samples. Search the code samples for “join data.frames”, copy the code for Conduct a left-join between two data.frames to the notebook cell, and modify it to apply to our data.
customers_enriched = left_join(df, df_orders, by=c("customerID" = "customer_id"))
Finally, the following code produces a paneled histogram with the bar heights normalized so that it’s easier to compare across values of campaign.
library(ggplot2) ggplot(customers_enriched, aes(total)) + geom_histogram() + facet_grid(. ~ campaign)
Recall that the notebook is a lab environment, so the Join we performed between the dataframes isn’t reflected in the Flow until we create a recipe.
From within the notebook, click Create Recipe > R recipe. It has automatically included the customers_stacked_prepared dataset as an input, but now we’ll want to add orders_by_customer as an input and create a new output dataset called customers_enriched.
Run the resulting recipe and see how the Flow is affected.
Congratulations! You’ve taken the first steps on R integration in Dataiku. As you progress, you’ll find that the use of R in Dataiku is extensible. You can create:
Code environments to manage package dependencies and versions for your projects
Custom R libraries: reuse code all over the place. Should connect in to the Git-based dev workflow