Concept | Computation engines#

Watch the video

Computation has a cost. In Dataiku, there are computation strategies that help with reducing this cost. Dataiku can perform the computation using the DSS engine or push down the computation to external engines. Dataiku acts as an orchestrator of your connections’ engines, delegating the computation to these connections when possible.

Computation delegation in Dataiku.

See also

For more information, see the Execution engines article in the reference documentation.

Sampling#

When transforming your dataset, you are actually working with a sample of it. Working with a sample allows Dataiku to provide fast and responsive visual feedback on your preparation steps, no matter the size of the whole dataset. When you are ready to apply transformation steps to the whole dataset, click Run. Dataiku selects a computation engine that best fits the underlying data storage and the operation you are applying to the dataset.

Screenshot of the computation engine selection screen.

Recipe engines#

Computation in Dataiku can take four main forms, described in the table below.

Computation engine

Description

In DSS engine

For in-memory execution, the data are stored in RAM. This strategy is used, for example, to execute Python or R recipes.

For streamed execution, Dataiku reads the input dataset as a stream of rows, applies computation to the rows as they arrive, and writes the output datasets row per row.

In database

This strategy is used, for example, to execute SQL queries. The visual recipe is translated into an SQL query, which is then injected, or pushed down, to the SQL server.

On Hadoop / Spark cluster

Depending on the engine you choose, the visual recipe is translated into a Hive, Impala, or Spark SQL query, which is then injected, or pushed down, to the Hadoop or Spark cluster.

In Kubernetes / Docker

This strategy is also in-memory or streamed but uses container execution through Docker and Kubernetes clusters rather than the Dataiku host server.

To avoid running out of memory when manipulating large datasets, Dataiku recommends offloading the computation to where the data lives. This way you avoid bringing all the data into memory or streaming it through the Dataiku server.

What’s next?#

In this article, you learned about ways to optimize computation when working in Dataiku. Continue getting to know the basics of Dataiku by learning about jobs.