Concept | Where computation happens in Dataiku#

In the Dataiku Flow, you take datasets from different sources (SQL, file-system, HDFS) and you seamlessly apply recipes (like SQL queries, preparation scripts or computing predictions from a model). But what happens under the hood, where does the computation actually take place?

This article presents the architecture behind Dataiku and explains why it matters.

Overview#

Computation in Dataiku can take four main forms:

in-memory
streamed
runs in database (SQL)
runs on-Hadoop.

Which strategy Dataiku uses depends on the dataset and the operation you are applying to the dataset.

Why does it matter? Well, sometimes if you apply the wrong type of recipe to your dataset, your computation will be painfully long or run out of memory! These issues particularly arise with big data.

Here is a table summarizing where the computation takes place for each operation performed in Dataiku:

	Locally in Dataiku	In Hadoop / Spark	In SQL Database	In Kubernetes / Docker
Visual Preparation Design	In-memory Sample	N/A	N/A	N/A
Visual Preparation Execution	YES Streaming	YES Spark	YES	N/A
Visual Recipes (other than Prepare)	YES Streaming or disk-copy	YES Hive, Spark, Impala	YES	N/A
Python and R recipes	YES In-memory or streaming	YES PySpark, SparkR, sparklyr	Custom code with DSS helper API	YES In-memory or streaming
Spark-Scala recipe	N/A	YES	N/A	N/A
Charts	YES	YES Hive, Impala (most charts)	YES (most charts)	N/A
Machine Learning train	YES scikit-learn, XGBoost, Keras/Tensorflow	YES MLlib	N/A	YES scikit-learn, XGBoost, Keras/Tensorflow
Machine Learning execution	YES scikit-learn, XGBoost, MLlib, Keras/Tensorflow	YES scikit-learn, XGBoost, MLlib	YES scikit-learn (partial), XGBoost, MLlib (some models)	YES scikit-learn, XGBoost, MLlib, Keras/Tensorflow
Python, R, Scala notebooks	YES In-memory or streaming	YES Spark-Scala, PySpark, SparkR, sparklyr	Custom code with Dataiku helper API	YES
SQL-like recipe or notebook	N/A	YES Hive, Impala, SparkSQL	YES	N/A

Some rules are clear and easy to understand. Take an SQL recipe. First, remember that the input datasets of your SQL recipe must be in an SQL database. When you run an SQL recipe, Dataiku will send a query to the SQL database. The database reads the input datasets, performs the SQL query, and finally writes the output dataset if it’s an SQL dataset, or streams the output otherwise.

So the computation actually runs in-database. The idea is the same for Hive recipes, but here is applied to HDFS datasets. Everything runs in-cluster.

Dataiku doesn’t solely act as a server, sending queries to an SQL-database or a Hadoop-cluster. It also performs computation. Called the DSS engine, it uses two main strategies:

In-memory: the data is stored in RAM.
Streamed: Dataiku reads the input dataset as a stream of rows, applies computations to the rows as they arrive, and writes the output datasets row per row.

Note

The DSS engine also uses a third strategy for charts, but this a particular topic that’s well-explained in the documentation.

In-memory processing#

By concept, in-memory computation can lead to out-of-memory situations on large datasets.

When can this annoying situation actually happen in Dataiku? As you can see in the above table, in-memory computation can happen in the following situations:

Python and R recipes
Editing a preparation recipe (or analysis)
Training a machine learning model

Visual data preparation#

When you are editing a preparation script (a Prepare recipe), Dataiku actually samples the dataset to ensure that the design computations will fit in RAM.

In a preparation script, you can add any processor and the editor is always fast and responsive. This remains true no matter the size of the original dataset because the computation occurs only on a small sample.

Note that you can change the sampling, and you could even use the whole dataset. But you should think twice before switching the sampling to the whole dataset when you have a large dataset.

First of all, it misses the point of having a fast and responsive editing mode, and you may even run out of memory!

Warning

In a preparation script, don’t use a design sample that’s too large.

Data preparation execution#

You may have noticed that the strategy used in preparation script execution (when you click run) is completely different from the one used in editing the script.

Dataiku will choose the best strategy available (streamed or in-cluster) to execute the script, but never in-memory. Dataiku will of course apply the computation to the whole dataset, which is why it doesn’t use in-memory processing.

Python and R recipes#

When you are using the pandas Python package, or R, you should keep in mind that both require in-memory data manipulation. As such, R and pandas aren’t well-suited tools for large data manipulation. If the input or output dataset can’t fit in RAM, you may run out of memory.

Warning

You can’t apply R recipes or Python recipes with pandas to datasets that are too big.

In Python, several workarounds are available:

Use streamed (per-row) processing instead of Dataframe processing
Use chunked processing (read dataframes by blocks)

Note

The reference documentation has details on these implementations in Python recipes.

Dealing with big data#

Python+Pandas or R remain useful when working with big data.

For example, sometimes, you can limit yourself to a sample of the dataset. It’s a good way to prototype your code. Dataiku allows you to stream a random sample of the data. You can then import it directly as:

A Pandas dataframe (by passing the sampling and limit arguments in the get_dataframe() function).
An R dataframe (by passing the sampling argument in the read.dataset() function).

Afterward, if you want to apply the computation to the whole dataset, you will have to switch to another computing strategy.

Dataiku offers many alternatives:

As mentioned, you could use a streamed implementation of your Python code.
A Python UDF processor in a preparation recipe could work.
Sometimes using an SQL recipe may be a good option.

However, when dealing with big data, you want to push the computation to the data as much as possible. Computing engines like Hadoop on HDFS follow this principle, and are specifically designed for big-data computing. Hive recipes, which run in-cluster, are well-suited for big data computing.

Dataiku allows you to switch between computing engines. For instance, suppose your current dataset is on the filesystem, but is too large for a Python recipe. You can use a Sync recipe to push your dataset to HDFS, and opt for a Hive or PySpark recipe instead.