Using SparkR in Dataiku

This short article shows you how to use Dataiku with SparkR, a Spark module built to interact with DataFrames through R syntax.

Prerequisites

You have access to an instance of DSS with Spark enabled, and a working installation of Spark, version 1.4+.

We’ll be using the Titanic dataset (here from a Kaggle contest), so make sure to first create a new DSS dataset and parse it into a suitable format for analysis.

Using SparkR interactively in Jupyter Notebooks

The best way to discover both your dataset and the SparkR API interactively is to use a Jupyter Notebook. From the top navigation bar of DSS, click on Notebook, and select R, pre-filled with “Template:Starter code for processing with SparkR”:

"Creating a new R notebook from starter SparkR code"

A Notebook shows up. Leveraging the template code, you can quickly get your DSS dataset in a SparkR DataFrame:

library(SparkR)
library(dataiku)
library(dataiku.spark)

# Initialize SparkR
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

# Load DSS dataset into in a Spark dataframe
titanic <- dkuSparkReadDataset(sqlContext, "titanic")

Now that your DataFrame is loaded, you can start using the SparkR API to explore it. Similarly to the PySpark API, SparkR provides us with some useful functions:

# How many records in the dataframe?
nrow(titanic)

# What's the exact schema of the dataframe?
schema(titanic)

# What's the content of the dataframe?
head(titanic)
"Schema and head of the Titanic dataframe"

Also, SparkR has functions to create aggregates:

head(
  summarize(
    groupBy(titanic, titanic$Survived),
    counts = n(titanic$PassengerId),
    fares = avg(titanic$Fare)
  )
)
"Count of passengers and average fare paid, aggregated by survival"

Make sure of course to regularly check the official documentation to stay current with the latest improvements of the SparkR API.

Integrating SparkR recipes in your workflow

Assuming you are ready to deploy your SparkR script, let’s switch to the Flow screen and create a new SparkR recipe:

"Creating a new SparkR recipe"

Specify the recipe inputs/outputs, and when in the code editor, copy/paste your R code:

library(SparkR)
library(dataiku)
library(dataiku.spark)

# Initialize SparkR
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

# Read input datasets
titanic <- dkuSparkReadDataset(sqlContext, "titanic")

# Aggregation
agg <- summarize(
         groupBy(titanic, titanic$Survived),
         counts = n(titanic$PassengerId),
         fares = avg(titanic$Fare)
       )

# Output datasets
dkuSparkWriteDataset(agg, "titanicr")

Your recipe is now ready. Just click the Run button and wait for your job to complete:

"Completed SparkR recipe"

We’re done for this short intro! SparkR being part of DSS, it is now possible to develop and manage completely Spark-based workflows using the language of your choice.