Tip | Using Spark#

Dataiku Cloud allows you to leverage Spark on Kubernetes (k8s) for distributed execution of heavy data wrangling jobs that are non-SQL compatible (e.g. some Prepare recipe processors).

When choosing Spark as a compute engine, you can choose the Spark config (given by a number of workers of a certain size in CPUs and RAM) in the Advanced tab of visual recipes.

Dataiku screenshot of Spark configuration.


As with containers, we recommend starting with the smallest Spark config as a test. Note that every worker spins up a separate container. For example the smallest Spark config, spark-XS-2-workers-of-1-CPU-3Gb-Ram, starts two containers of 1 CPU and 3 GB of RAM each, and so will consume a total of 2 CPUs and 6 GB of RAM of your quota (but only 1 parallel activity).