Tip | Using Spark#

Dataiku Cloud allows you to leverage Spark on Kubernetes (k8s) for distributed execution of heavy data wrangling jobs that are non-SQL compatible (for example, some Prepare recipe processors).

When choosing Spark as a compute engine, you can choose the Spark config in the Advanced tab of visual recipes. The Spark config is given by a number of workers of a certain size in CPUs and RAM.

Dataiku screenshot of Spark configuration.

Note

As with containers, try starting with the smallest Spark config as a test. Note that every worker spins up a separate container. For example, the smallest Spark config (spark-XS-2-workers-of-1-CPU-3Gb-Ram) starts two containers of 1 CPU and 3 GB of RAM each. Therefore, it will consume a total of 2 CPUs and 6 GB of RAM of your quota (but only 1 parallel activity).