Tutorial | SparkR in Dataiku#
Introduction#
This short article shows you how to use Dataiku with SparkR, a Spark module built to interact with DataFrames through R syntax.
Prerequisites#
You have access to an instance of Dataiku with Spark enabled, and a working installation of Spark, version 1.4+.
We’ll be using the Titanic dataset (here from a Kaggle contest), so make sure to first create a new Dataiku dataset and parse it into a suitable format for analysis.
Using SparkR interactively in Jupyter notebooks#
The best way to discover both your dataset and the SparkR API interactively is to use a Jupyter Notebook. From the top navigation bar of Dataiku, click on Notebook, and select R, pre-filled with “Template:Starter code for processing with SparkR”:
A Notebook shows up. Leveraging the template code, you can quickly get your Dataiku dataset in a SparkR DataFrame:
library(SparkR)
library(dataiku)
library(dataiku.spark)
# Initialize SparkR
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
# Load DSS dataset into in a Spark dataframe
titanic <- dkuSparkReadDataset(sqlContext, "titanic")
Now that your DataFrame is loaded, you can start using the SparkR API to explore it. Similarly to the PySpark API, SparkR provides us with some useful functions:
# How many records in the dataframe?
nrow(titanic)
# What's the exact schema of the dataframe?
schema(titanic)
# What's the content of the dataframe?
head(titanic)
Also, SparkR has functions to create aggregates:
head(
summarize(
groupBy(titanic, titanic$Survived),
counts = n(titanic$PassengerId),
fares = avg(titanic$Fare)
)
)
Make sure of course to regularly check the official documentation to stay current with the latest improvements of the SparkR API.
Integrating SparkR recipes in your workflow#
Assuming you are ready to deploy your SparkR script, let’s switch to the Flow screen and create a new SparkR recipe:
Specify the recipe inputs/outputs, and when in the code editor, copy/paste your R code:
library(SparkR)
library(dataiku)
library(dataiku.spark)
# Initialize SparkR
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
# Read input datasets
titanic <- dkuSparkReadDataset(sqlContext, "titanic")
# Aggregation
agg <- summarize(
groupBy(titanic, titanic$Survived),
counts = n(titanic$PassengerId),
fares = avg(titanic$Fare)
)
# Output datasets
dkuSparkWriteDataset(agg, "titanicr")
Your recipe is now ready. Just click the Run button and wait for your job to complete:
We’re done for this short intro! SparkR being part of Dataiku, it is now possible to develop and manage completely Spark-based workflows using the language of your choice.