Compute and Resource Quotas on Dataiku Cloud¶
Troubleshoot | My job takes an unusually long time to complete
Troubleshoot | My job queues for a long time and then fails without ever starting
Dataiku supports multiple execution engines to meet the needs of various jobs. Explore resources here for navigating computation and resource quotas on Dataiku Cloud.
Reference | Overview of compute engines on Dataiku Cloud¶
On Dataiku Cloud, you can access different kinds of compute resources to execute jobs:
Runs in-database (SQL)¶
Among supported databases, if you have connected a SQL database (such as Snowflake, Redshift, Google BigQuery, Microsoft SQL Server, PostgreSQL), you can push down compute in-database. When using Dataiku Cloud, in-database should be your preferred compute engine for all eligible tasks.
As always, input and output datasets both need to be in the SQL database. Also note that this compute engine is only available for SQL-type jobs. That includes SQL code, most visual recipes, as well as all Prepare recipes made of SQL-translatable processors.
Containerized execution on fully managed elastic AI compute¶
Dataiku Cloud includes fully managed Elastic AI compute capacities based on Kubernetes (k8s) to execute various workloads (containerized execution).
We recommend using containerized execution for all tasks where in-database compute is not possible, in particular Python notebooks and Spark jobs. More details on this compute engine are in this section.
Jobs can also be executed locally using the same resources as the Dataiku application itself. The execution is done locally every time that “DSS - Local Stream” or “Use backend to execute” are selected.
By concept, this type of execution can alter the performance of the application. Using in-memory processing is not recommended when you can leverage a database.
Reference | Leveraging fully managed elastic AI compute¶
Elastic AI compute can be used to execute:
Python code recipes
Any visual recipe you want to run using Spark
Visual or code-based ML model training
To leverage elastic AI compute, you have to choose a container configuration in the Advanced tab of a recipe or Runtime Environment panel of a visual analysis task.
You can choose the container’s capacity in terms of CPU and RAM, as well as the code environment to include. For some advice on what container you should choose, see this guidance.
Choosing a container in this way leverages a container that is apart from the rest of your application and dedicated to this task. This behavior ensures that it won’t interfere with other processes.
Reference | Managing elastic AI compute capacity¶
Dataiku Cloud manages the infrastructure of your instance and provides elastic AI computing capabilities that can be used on containerized execution. These capabilities depend on your subscription and are defined by quotas on three dimensions: CPU, GB of RAM, and parallel activities. These dimensions act as limits and define the maximum concurrent usage of those resources.
Your quota is reflected in the Running Tasks & Quota panel in your Launchpad. It is a common pool of resources shared by all users. This means that the capacities used by a task won’t be available for others until that task is finished. If a new task requests more resources than those left available (on either one of the three quotas—CPU, RAM and parallel activities), that task is queued until the resources it requests have been freed.
When a user starts a job requesting containerized execution, it launches one or several containers. This withdraws from your quota the CPU and RAM it uses, as well as one parallel activity. The quotas used for the containers will be freed when the job is finished. Webapps and notebooks need to be closed (unloaded) to free the resources they’re using (see section below for more details).
Jobs on partitioned datasets launch several containers and count as several parallel activities. As many partitions as available parallel activities in your quota will be processed simultaneously in a dedicated container. You can limit the maximum number of parallel activities requested by the recipe in the Advanced tab.
Reference | Resource quota management¶
The quotas you are entitled to based on your subscription are reflected in the Running Tasks & Quota panel of the Launchpad.
In this panel, the space admin is able to see the tasks, notebooks, and jobs that are currently running on your instance and the use of your quotas. The charts are accurate the moment you click the Refresh button.
To free resources, a space admin can stop any task and unload notebooks directly from this panel by clicking on the cross next to each one.
Tip | Choosing container sizes¶
The largest container available is given by the quota included in your subscription. That is, if your quota is 10 CPUs and 80 GB of RAM, the largest container available in the drop-down menu will be “CPU-10-RAM80Gb”.
Using a large container by default is not recommended as it can exhaust available resources very quickly and prevent others from executing their jobs. We recommend starting with the smallest container available and increasing its size if need be.
There are generally two cases when to increase the size of the container:
The execution failed with an “out of memory” error because the container is too small. In that case, it is recommended to increase the container size so as to allow more memory.
The execution is too long and the execution can be parallelized, such as with hyperparameter search in visual ML.
In the case of working with a very large dataset, you can also start by executing the job on a sample of the data and the smallest container as a way to test it.
Tip | Using Spark¶
Dataiku Cloud allows you to leverage Spark on Kubernetes (k8s) for distributed execution of heavy data wrangling jobs that are non-SQL compatible (e.g. some Prepare recipe processors).
When choosing Spark as a compute engine, you can choose the Spark config (given by a number of workers of a certain size in CPUs and RAM) in the Advanced tab of visual recipes.
As with containers, we recommend starting with the smallest Spark config as a test. Note that every worker spins up a separate container. For example the smallest Spark config, spark-XS-2-workers-of-1-CPU-3Gb-Ram, starts two containers of 1 CPU and 3 GB of RAM each, and so will consume a total of 2 CPUs and 6 GB of RAM of your quota (but only 1 parallel activity).
Troubleshoot | My job takes an unusually long time to complete¶
Your job might be queuing because other jobs (launched by yourself or by other users on your space) are consuming all of your allowed resources. Other users in your account might be using some of the resources in your subscription’s quota, so your job might be queuing before starting. You can check your resource quotas to investigate.
Also, note that there might be a latency of up to 2 minutes for the job to start, as it may require to bring additional resources. This may happen more often with larger configurations.
Troubleshoot | My job queues for a long time and then fails without ever starting¶
ML training jobs can be queued for a maximum of 30 minutes. If resources are not available before those 30 minutes have passed, the training is aborted automatically, and you will have to restart it manually.