Compute and Resource Quotas on Dataiku Online

Overview of compute engines on Dataiku Online

On Dataiku Online, you can access different compute resources to execute jobs:

1. Runs in-database (SQL)

You can connect your existing databases to Dataiku Online (see this article for available connectors).

If you have connected a SQL database (such as Snowflake, Redshift, Google BigQuery, Microsoft SQL Server, PostgreSQL), you can push down compute in-database.

As always, input and output datasets both need to be in the SQL database. Also note that this compute engine is only available for SQL-type jobs. That includes SQL code, most visual recipes, as well as all Prepare recipes made of SQL-translatable processors (see full list of processors here).

When using Dataiku Online, in-database should be your preferred compute engine for all eligible tasks.

2. Containerized execution on fully managed Elastic AI compute

Dataiku Online includes fully managed Elastic AI compute capacities based on Kubernetes (k8s) to execute various workloads (containerized execution).

We recommend using containerized execution for all tasks where in-database compute is not possible, in particular Python notebooks and Spark jobs. More details on this compute engine are outlined below.

3. Local execution

Jobs can also be executed locally using the same resources as the DSS application itself. By concept, this type of execution can alter the performance of the application. Using in-memory processing is not recommended when you can leverage a database. The execution is done locally every time that “DSS - Local Stream” or “Use backend to execute” are selected.

How to best leverage fully managed Elastic AI compute on Dataiku Online

Elastic AI compute can be used to execute:

  • Python code recipes

  • Any visual recipe you want to run using Spark

  • Visual or code-based ML model training

  • Notebooks

  • Webapps

  • Code Studios

To leverage Elastic AI compute, you have to choose a container configuration in the Advanced tab of a recipe or Runtime Environment tab of a visual analysis task.

You can choose the container’s capacity in terms of CPU and RAM, as well as the code environment to include. (For some advice on what container you should choose, see the section How to choose my container size below).

Choosing a container in this way leverages a container that is apart from the rest of your application and dedicated to this task. This behavior ensures that it won’t interfere with other processes.

Dataiku screenshot of the container configuration selection behavior dropdown.

What Elastic AI compute capacity can you access?

Dataiku Online manages the infrastructure of your instance and provides Elastic AI computing capabilities that can be used on containerized execution. These capabilities depend on your subscription and are defined by quotas on three dimensions: CPU, GB of RAM, and parallel activities. These dimensions act as limits and define the maximum concurrent usage of those resources.

Your quota is reflected in the “Running Tasks & Quota” tab in your launchpad. It is a common pool of resources shared by all users. This means that the capacities used by a task won’t be available for others until that task is finished. If a new task requests more resources than those left available (on either one of the three quotas, CPU, RAM and parallel activities), that task is queued until the resources it requests have been freed.

When a user starts a job requesting containerized execution, it launches one or several containers. This withdraws from your quota the CPU and RAM it uses, as well as one parallel activity. The quotas used for the containers will be freed when the job is finished. Webapps and notebooks need to be closed (unloaded) to free the resources they’re using (see section below for more details).

A quick note on partitioned datasets: jobs on partitioned datasets launch several containers and count as several parallel activities. As many partitions as available parallel activities in your quota will be processed simultaneously in a dedicated container. You can limit the maximum number of parallel activities requested by the recipe in the “Advanced” tab.

How to find and manage my quotas

The quotas you are entitled to thanks to your subscription are reflected on the Launchpad in the “Running Tasks & Quota” tab.

In this tab, the space admin is able to see the tasks, notebooks, and jobs that are currently running on your instance and the use of your quotas. The charts are accurate the moment you press the Refresh button. To free resources, a space admin can stop any task and unload notebooks directly from this tab by clicking on the cross next to each one.

Dataiku screenshot of the running tasks and quota tab of an Online instance space.

How to choose my container size

Note that the biggest container available is given by the quota included in your subscription (ie if your quota is 10 CPUs and 80 GB of RAM, the biggest container available in the drop-down menu will be “CPU-10-RAM80Gb”).

Using a large container by default is not recommended as it can exhaust available resources very quickly and prevent others from executing their jobs. We recommend starting with the smallest container available and increasing its size if need be. There are generally two cases when to increase the size of the container:

  • When the execution fails due to an “out of memory” error because the container is too small. In that case, it is recommended to increase the container size so as to allow more memory.

  • When execution is too long and the execution can be parallelized, such as with hyperparameter search in visual ML.

In the case of working with a very large dataset, you can also start by executing the job on a sample of the data and the smallest container as a way to test it.

A note about using Spark on Online

Dataiku Online allows you to leverage Spark on k8s for distributed execution of heavy data wrangling jobs that are non-SQL compatible (e.g. some Prepare recipe processors). When choosing Spark as a compute engine, you can choose the Spark config (given by a number of workers of a certain size in CPUs and RAM) in the “Advanced” tab of visual recipes.

Dataiku screenshot of Spark configuration.

As with the containers, we recommend starting with the smallest Spark config as a test. Note that every worker spins up a separate container. For example the smallest Spark config, spark-XS-2-workers-of-1-CPU-3Gb-Ram, starts two containers of 1 CPU and 3 GB of RAM each, and so will consume a total of 2 CPUs and 6 GB of RAM of your quota (but only 1 parallel activity).

Troubleshooting - common issues

My job takes an unusually long time to complete

Your job might be queuing because other jobs (launched by yourself or by other users on your space) are consuming all of your allowed resources. Other users in your account might be using some of the resources in your subscription’s quota, so your job might be queuing before starting. See how to check your quotas here: How to find and manage my quotas.

Also, note that there might be a latency of up to 2 minutes for the job to start, as it may require to bring additional resources. This may happen more often with larger configurations.

My job queues for a long time and then fails without ever starting

ML training jobs can be queued for a maximum of 30 minutes. If resources are not available before those 30 minutes have passed, the training is aborted automatically, and you will have to restart it manually.