Tutorial | Dataiku for R users#

Get started#

Dataiku’s visual interface enables collaboration with a wide pool of colleagues who may not be coders (or R coders for that matter). At the same time, code integrations for languages like Python and R retain the flexibility needed when greater customization or freedom is desired.

Objectives#

In this tutorial, you will:

  • Build an ML pipeline from R notebooks, recipes, and project variables.

  • Apply a custom R code environment to use CRAN packages not found in the built-in environment.

  • Use the Dataiku R API to edit Dataiku recipes and create ggplot2 insights from within RStudio.

  • Import R code from a Git repository into a Dataiku project library.

  • Work with managed folders to handle file types such as *.RData.

Prerequisites#

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Dataiku for R Users.

  4. Click Install.

  5. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Observe the visual Flow#

We’ll start with a project built entirely with visual tools:

  • The data pipeline consists of visual recipes.

  • The native chart builder has been used for exploratory visualizations.

  • The machine learning model (shown in green) has been built with the visual ML interface.

Although using visual tools can amplify collaboration and understanding with non-R users, recreating this Flow in R opens opportunities for greater customization at every stage.

When you have completed the tutorial, you will have built the bottom half of the Flow pictured below:

Dataiku screenshot of a Flow with parallel R and visual workflows.

Even if primarily an R user, it will be helpful for you to familiarize yourself with the available set of visual recipes and what they can achieve.

Although the table below is far from 1-1 matching, it suggests a Dataiku recipe that performs a similar operation for some of the most common data preparation functions in base R or the tidyverse.

R package

R function

Similar Dataiku recipe/processor

dplyr

mutate()

Formulas

dplyr

select()

Delete/Keep processor

dplyr

filter()

dplyr

arrange()

Sort recipe

dplyr

group_by() %>% summarize()

Group recipe

dplyr

group_by() %>% mutate()

Window recipe

dplyr

\*_join()

Join recipe

dplyr

distinct()

Distinct recipe

tidyr

gather() / pivot_longer()

Fold multiple columns processor

tidyr

spread() / pivot_wider()

Pivot recipe

base, dplyr

rbind(), bind_rows()

Stack recipe

base, dplyr

subset(), group_split()

Split recipe

base, dplyr

head/tail(), slice_min/max()

Top N recipe

stringr

NA

Transform string processor et al.

lubridate

NA

Date processors

fuzzyjoin

NA

Fuzzy Join recipe

Note

As shown in the table, processors found in the Prepare recipe handle many data preparation functions. Moreover, many recipes and processors — although having a visual interface on top — are SQL-compatible.

Code in an R notebook#

Let’s start writing R code!

Create an R notebook from a dataset#

The fastest way to start writing R code is in a Jupyter notebook.

  1. From the Flow, select the churn_copy dataset.

  2. In the right side panel, navigate to the Lab (Lab icon.) tab.

  3. Under Code Notebooks, select New.

  4. In the dialog, select R.

  5. Click Create.

Dataiku screenshot of the dialog to create an R notebook.

Inspect the starter code#

It’s worth taking a moment to understand the starter code in the default notebook.

  • The cell library(dataiku) loads the dataiku R package, which includes functions for interacting with Dataiku objects, such as datasets and folders.

  • The second cell creates an in-memory R dataframe named df from the Dataiku dataset from which this notebook was created.

Important

The churn_copy dataset, in this case, is a managed filesystem dataset, resulting from the original uploaded CSV file. However, if the Sync recipe were instead moving the CSV file to an SQL database, an HDFS cluster, or cloud storage, the syntax in the R notebook would be exactly the same.

  1. Run the first two cells in the notebook.

  2. Add a new cell with code like head(df) to start exploring the dataframe as you normally would.

Dataiku screenshot of an R notebook.

Note

dkuReadDataset() is not the only way to read a dataset with R. dkuSQLQueryToData() makes it possible to execute SQL queries from R. This can be helpful when you want to pull in a specific query of records into Dataiku, rather than any of the standard sampling options.

Use dplyr in a notebook#

Notice that the kernel of this notebook in the upper right corner says that it uses the built-in R environment.

Let’s find out what’s included in this environment.

  1. Add a new cell to the notebook.

  2. Run the command library()$results[,1] to see a list of installed packages in the current environment.

You’ll find many familiar packages, including those found in the tidyverse, are base packages. Let’s use dplyr in this notebook for example.

  1. Add library(dplyr) in a cell at the top of the notebook.

  2. Delete any exploratory code.

  3. After the assignment of df, add the following code to create a new dataframe df_prepared_r.

df %>%
    rename(Churn = Churn.) %>%
    mutate(Churn = if_else(Churn == "True.", "True", "False"),
        Area_Code = as.character(Area_Code)) %>%
    select(-Phone) ->
    df_prepared_r

Set a new code environment#

In many cases, we’ll want to use R packages not found in the built-in environment. In these cases, we (or an admin) will need to create a code environment.

For this tutorial, we’ll need the following R packages:

tidyr
ggplot2
gbm
caret
  1. If you don’t already have such a code environment, see How-to | Create a code environment.

  2. Select this code environment at the project level following How-to | Set a code environment.

Iterate between an R notebook and a recipe#

The df_prepared_r dataframe only exists in the Lab, an experimental space for prototyping. This work does not exist yet in the Flow, where production outputs are actually built. For the latter, we’ll need a code recipe.

Create an R recipe from a notebook#

We can convert the existing code notebook into a code recipe. To do so, we’ll need to define an output to the recipe.

  1. From the notebook, click + Create Recipe.

  2. Click OK to accept the standard R recipe type.

  3. Under Outputs, click + Add.

  4. Name the output dataset churn_prepared_r.

  5. Click Create Dataset.

  6. Click Create Recipe.

Dataiku screenshot of the dialog to create a code recipe from a notebook.

Run an R recipe#

The recipe contains the same code found in the notebook, but with an additional line to write the output. We’ll need to change this line to write the new dataframe we’ve created.

  1. Change the last line to dkuWriteDataset(df_prepared_r,"churn_prepared_r").

  2. Click Run (or type @ + r + u + n) to execute the recipe, and then explore the output dataset in the Flow.

Use the R API outside Dataiku#

You have seen how the Dataiku R API works in notebooks and recipes within Dataiku. However, you can also follow the reference documentation for Using the R API outside of DSS, such as in an IDE like RStudio.

Important

The instructions for downloading the dataiku package and setting up the connection with Dataiku are covered in the reference documentation. If you’d rather not set this up at this time, feel free to create a new R notebook within Dataiku for this section.

After configuring a connection, you can use the Dataiku R API to read datasets found in Dataiku projects and code freely, even sharing visualizations, for example, back to the Dataiku instance.

  1. In a new R script of your IDE (or a new R notebook if staying within Dataiku), copy/paste and run the code below to save a ggplot2 object as a static insight.

    Note

    If working outside Dataiku, you’ll need to supply an API key. One way to find this is by going to Profile & Settings > API keys. Also, be sure to check that your project key is the same as given below.

    library(dataiku)
    library(dplyr)
    library(tidyr)
    library(ggplot2)
    
    # These lines are unnecessary if running within Dataiku
    dkuSetRemoteDSS("http(s)://DSS_HOST:DSS_PORT/", "Your API Key")
    dkuSetCurrentProjectKey("DKU_TUT_R_USERS") # Replace with your project key if different
    
    # Read the dataset as an in-memory R dataframe
    df <- dkuReadDataset("churn_prepared_r", samplingMethod="head", nbRows=100000)
    
    # Create the plot
    df %>%
      select(-c(State, Area_Code, Intl_Plan, VMail_Plan)) %>%
      gather("metric", "value", -Churn) %>%
      ggplot(aes(x = value, color = Churn)) +
      facet_wrap(~ metric, scales = "free") +
      geom_density()
    
    # Save plot as a static insight
    dkuSaveGgplotInsight("density-plots-by-churn")
    
  2. After running the code above, return to Dataiku, and navigate to the Insights page (g + i) to confirm the insight has been added.

  3. If you wish, you can publish it to a dashboard like any other insight, such as native charts or model reports.

Tip

In addition to ggplot2, the Dataiku R API has similar convenience functions for creating static insights with dygraphs, ggvis, and googleVis. You can gain more practice creating static insights with ggplot2 in Tutorial | Static insights.

Note

The code above visualizes the distribution for all numeric variables in the dataset among churning and returning customers. While the distribution for many variables is quite similar, a few variables like CustServ_Calls, Day_Charge, and Day_Mins follow different patterns.

Grid of density plots for each numeric variable by churn status

Edit recipes from RStudio#

Returning to the Flow, you can see that the Split recipe divides the prepared data into a training set (70%) and a test set (30%).

Let’s achieve the same outcome with another R recipe, but demonstrate using the RStudio Desktop integration for editing and saving existing recipes.

Note

In addition to the RStudio integration used here, some users may also prefer to write R code in the RStudio Server IDE through a Code Studio template.

Create an R recipe from the Flow#

In addition to converting R notebooks to R recipes, we can also directly create a new R recipe.

  1. Select the churn_prepared_r dataset.

  2. In the Actions panel, select a new R recipe.

  3. Under Outputs, create two output datasets, train_r and test_r.

  4. Click Create Recipe.

  5. In the recipe editor, click Save.

Dataiku screenshot of the dialog for creating an R recipe from the Flow.

Download an R recipe to RStudio#

Now that you have created the recipe, let’s edit it in RStudio, and save the new version back to the Dataiku instance.

Important

If you followed the setup in the section above, there are no additional configuration steps needed. Alternatively, you can also skip this step, and directly edit the R recipe within Dataiku.

  1. Within RStudio, create a new R script.

  2. From the Addins menu, select Dataiku: download R recipe code.

  3. Choose the project key, DKU_TUT_R_USERS.

  4. Choose the recipe ID, compute_train_r.

  5. Click Download.

    Dialog window from RStudio addin asking user which recipe from which project key to download.

    The previously empty R script should now be filled with the same R code found on the Dataiku instance. Let’s edit it to mimic the action of the visual Split recipe.

  6. Replace the existing R script with the new code below.

    library(dataiku)
    library(dplyr)
    
    # Recipe inputs
    churn_prepared_r <- dkuReadDataset("churn_prepared_r", samplingMethod="head", nbRows=100000)
    
    # Data preparation
    churn_prepared_r %>%
        rowwise() %>%
        mutate(splitter = runif(1)) %>%
        ungroup() ->
        df_to_split
    
    # Compute recipe outputs
    train_r <- subset(df_to_split, df_to_split$splitter <= 0.7)
    test_r <- subset(df_to_split, df_to_split$splitter > 0.7)
    
    # Recipe outputs
    dkuWriteDataset(train_r,"train_r")
    dkuWriteDataset(test_r,"test_r")
    

Save an R recipe back to Dataiku#

Now, let’s save it back to the Dataiku instance.

  1. From the Addin menu of RStudio, select Dataiku: save R recipe code.

  2. After ensuring the correct project key and recipe ID are selected, click Send to DSS.

  3. Return to the Dataiku instance, and confirm that the new recipe has been updated after refreshing the page.

  4. From the recipe editor, click Run to build both output datasets.

Note

One limitation to using the Dataiku R API outside Dataiku is the ability to write datasets. You cannot write from RStudio to a Dataiku dataset as explained in this How-to | Edit Dataiku recipes in RStudio.

Code with project variables#

We now have train and test sets ready for modeling, but first let’s demonstrate how project variables can be useful in a data pipeline such as this.

In the modeling stage ahead, it will be convenient to have our target variable, numeric variables, and character variables stored as separate vectors. It could be helpful to save these vectors as project variables instead of copying and pasting them for the forthcoming training and scoring recipes.

  1. From the top navigation bar, go to the Code (Code icon.) menu, and open the Notebooks page (g + n).

  2. Open the code notebook for the churn_copy dataset.

  3. Add the code snippet below to the end of the recipe in new cells. Walk through it line by line to understand how this section gets and sets project variables using the functions dkuGetProjectVariables() and dkuSetProjectVariables() from the R API.

    # Empty any existing project variables
    var <- dkuGetProjectVariables()
    var$standard <- list(standard=NULL)
    dkuSetProjectVariables(var)
    
    # Define target, categoric, and numeric variables
    target_var <- "Churn"
    categoric_vars <- names(df_prepared_r)[sapply(df_prepared_r, is.character)]
    categoric_vars <- categoric_vars[!categoric_vars %in% c("Churn")]
    numeric_vars <- names(df_prepared_r)[sapply(df_prepared_r, is.numeric)]
    
    # Get and set project variables
    var <- dkuGetProjectVariables()
    
    var$standard$target_var <- target_var
    var$standard$categoric_vars <- categoric_vars
    var$standard$numeric_vars <- numeric_vars
    
    dkuSetProjectVariables(var)
    
  4. After running this code, navigate to the More Options (Horizontal dots icon.) > Variables page from the top navigation bar.

    You should see three global variables — meaning these variables are accessible anywhere in the project.

Dataiku screenshot of the project variables page with 3 variables.

Tip

Try opening a new R notebook, and running vars <- dkuGetProjectVariables() to confirm how these variables are now accessible anywhere in the project as an R list.

Execute a machine learning workflow#

Review the ML workflow#

Now that we have prepared and split the dataset, we are ready to begin modeling.

The green icons in the Flow represent the machine learning portion of the Flow. To summarize the visual ML workflow:

  • Train a model in the Lab.

  • Deploy the best version to the Flow as a saved model.

  • Send test data and the saved model to the Score recipe in order to produce output predictions.

The same workflow can be achieved with R recipes:

  • Write an R recipe that trains a model.

  • Output the model to a managed folder.

  • Write another R recipe to score the test data using the model in the folder.

To do this, you’ll need to be able to interact with managed folders through the R API.

Create an R recipe with a managed folder output#

Given that we have the necessary packages available in the project code environment, we are ready to create an R recipe that trains a model. Unlike previous recipes however, the output of this recipe will be a managed folder instead of a dataset.

We can store any kind of file (supported or unsupported) in a managed folder, and use the Dataiku R API to interact with the files stored inside.

  1. From the Flow, select the train_r dataset.

  2. In the Actions panel, select an R recipe.

  3. Under Outputs, click + Add and switch to New Folder.

  4. Name the output folder model_r.

  5. Click Create Folder.

  6. Click Create Recipe.

Dataiku screenshot of the dialog to create an R recipe with a folder output.

Import code from a Git repository#

We now have the correct code environment, input, and output to build our model. Let’s start coding!

Imagine, however, that we want to reuse some code already developed outside of Dataiku. Perhaps we want to reuse the same parameters or hyperparameter settings found in models elsewhere.

Let’s import code from a Git repository so that it can be used in the current recipe.

Important

If you’re unable to import the code from the Git repository, feel free to just copy-paste it into the recipe instead.

  1. From the Code (Code icon.) menu of the top navigation bar, select Libraries (g + l).

  2. Click Git > Import from Git.

  3. In the dialog window, supply the HTTPS link for this academy-samples GitHub repository (found by clicking on the Code button and then the clipboard).

  4. Check out the default main branch.

  5. Add /r-users/ as the path in the repository.

  6. Add /R/ as the target path of the project library.

  7. Uncheck Add to Python path.

  8. Click Save and Retrieve.

  9. Click OK to confirm the creation of the Git reference has succeeded.

Dataiku screenshot of the dialog for creating a Git reference.

Let’s recap what this achieved:

  • The same file train_settings.R found in the GitHub repository is now also in the project library. It can be used in this project (or potentially in other Dataiku projects as well by editing the importLibrariesFromProjects name of the external-libraries.json file).

  • Open the file external-libraries.json to view the Git reference.

Note

The reference documentation provides more details on reusing R code.

Train a model with an R recipe#

Once the Git reference is created, we can import the contents of a file found in the project library with the function dkuSourceLibR().

  1. Return to the R recipe that outputs the model_r folder.

  2. Replace the existing recipe with the code snippet below, taking note of the following:

    • The gbm and caret packages can be used because of the project-level code environment.

    • dkuSourceLibR() imports the objects fit.control and gbm.grid found in the train_settings.R file.

    • dkuGetProjectVariables() calls the name of the project variables set earlier.

    library(dataiku)
    library(gbm)
    library(caret)
    
    # Import from project library
    dkuSourceLibR("train_settings.R")
    
    # Recipe inputs
    df <- dkuReadDataset("train_r")
    
    # Call project variables
    vars <- dkuGetProjectVariables()
    target.variable <- vars$standard$target_var
    features.cat <- unlist(vars$standard$categoric_vars)
    features.num <- unlist(vars$standard$numeric_vars)
    
    # Preprocessing
    df[features.cat]    <- lapply(df[features.cat], as.factor)
    df[features.num]    <- lapply(df[features.num], as.double)
    df[target.variable] <- lapply(df[target.variable], as.factor)
    train.ml <- df[c(features.cat, features.num, target.variable)]
    
    # Training (fit.control and gbm.grid found in train_settings.R)
    gbm.fit <- train(
        Churn ~ .,
        data = train.ml,
        method = "gbm",
        trControl = fit.control,
        tuneGrid = gbm.grid,
        metric = "ROC",
        verbose = FALSE
    )
    
    # Recipe outputs
    model_r <- dkuManagedFolderPath("model_r")
    setwd(model_r)
    system("rm -rf *")
    path <- paste(model_r, 'model.RData', sep="/")
    save(gbm.fit, file = path)
    

Write an R recipe output to a folder#

There’s one problem before we can run this recipe.

This code uses the default dkuManagedFolderPath() to retrieve the file path used to write the model to the folder output. However, this function works only for local folders.

If the data was hosted somewhere other than the local filesystem, or the code was not running on the Dataiku machine, this code would fail.

Let’s modify this recipe to work for a local or non-local folder using dkuManagedFolderUploadPath() instead of dkuManagedFolderPath().

  1. Replace the recipe outputs section with the code below.

    # Recipe outputs (local or non-local folder)
    save(gbm.fit, file= "model.RData")
    connection <- file("model.RData", "rb")
    dkuManagedFolderUploadPath("model_r", "model.RData", connection)
    close(connection)
    
  2. Click Run to train the model and save it in the folder.

  3. When it’s finished, open the folder, and confirm it holds the model.RData file.

Score test data with an R recipe#

There’s one last step to complete this Flow!

Now that we have a trained model in a managed folder, we can use it to score the testing data with another R recipe.

  1. From the Flow, select the model_r folder.

  2. In the Actions panel, select an R recipe.

  3. Under Inputs, click + Add, and select the test_r dataset.

  4. Under Outputs, click +Add. Name the new output dataset test_scored_r.

  5. Click Create Dataset.

  6. Click Create Recipe.

Dataiku screenshot of a dialog to create an R recipe.

Once we have the correct inputs and output, we just need to supply the code.

  1. Replace the default code with the snippet below.

    library(dataiku)
    library(gbm)
    library(caret)
    
    # Load R model (local or non-local folder)
    data <- dkuManagedFolderDownloadPath("model_r", "model.RData")
    load(rawConnection(data))
    
    # Load R model (local folder only)
    # model_r <- dkuManagedFolderPath("model_r")
    # path <- paste(model_r, 'model.RData', sep="/")
    # load(path)
    
    # Confirm model loaded
    print(gbm.fit)
    
    # Recipe inputs
    df <- dkuReadDataset("test_r")
    
    # Call project variables
    vars <- dkuGetProjectVariables()
    target.variable <- vars$standard$target_var
    features.cat <- unlist(vars$standard$categoric_vars)
    features.num <- unlist(vars$standard$numeric_vars)
    
    # Preprocessing
    df[features.cat]    <- lapply(df[features.cat], as.factor)
    df[features.num]    <- lapply(df[features.num], as.double)
    df[target.variable] <- lapply(df[target.variable], as.factor)
    test.ml <- df[c(features.cat, features.num, target.variable)]
    
    # Prediction
    o <- cbind(df, predict(gbm.fit, test.ml,
                        type = "prob",
                        na.action = na.pass))
    
    # Recipe outputs
    dkuWriteDataset(o, "test_scored_r")
    
  2. Click Run to execute the recipe.

Note

In addition to standard R code, note how the code above uses the Dataiku R API:

  • dkuManagedFolderDownloadPath() interacts with the contents of a (local or non-local) managed folder. The strictly local alternative using dkuManagedFolderPath() is also provided for demonstration in comments.

  • dkuReadDataset() and dkuWriteDataset() handle reading and writing of dataset inputs and outputs.

  • dkuGetProjectVariables() retrieves the values of project variables.

What’s next?#

Congratulations! You’ve built an ML pipeline in Dataiku entirely with R. You’ve also demonstrated how this could be done within Dataiku or from an external IDE such as RStudio.

See also

See the reference documentation for more information about DSS and R.