Tip | Using Spark#

Dataiku Cloud allows you to leverage Spark on Kubernetes (k8s) for distributed execution of heavy data wrangling jobs that are non-SQL compatible (e.g. some Prepare recipe processors).

When choosing Spark as a compute engine, you can choose the Spark config (given by a number of workers of a certain size in CPUs and RAM) in the Advanced tab of visual recipes.

Dataiku screenshot of Spark configuration.

Note

As with containers, we recommend starting with the smallest Spark config as a test. Note that every worker spins up a separate container. For example the smallest Spark config, spark-XS-2-workers-of-1-CPU-3Gb-Ram, starts two containers of 1 CPU and 3 GB of RAM each, and so will consume a total of 2 CPUs and 6 GB of RAM of your quota (but only 1 parallel activity).