Reference | Managing elastic AI compute capacity#

Dataiku Cloud manages the infrastructure of your instance and provides elastic AI computing capabilities that can be used on containerized execution. These capabilities depend on your subscription. Three dimensions (CPU, GB of RAM, and parallel activities) define these quotas. These dimensions act as limits and define the maximum concurrent usage of those resources.

The Usage & Monitoring panel in your Launchpad reflects your quota. It’s a common pool of resources shared by all users. This means that the capacities used by a task won’t be available for others until that task finishes. If a new task requests more resources than those left available (on either one of the three quotas — CPU, RAM, and parallel activities), that task is queued until the resources it requests have been freed.

When a user starts a job requesting containerized execution, it launches one or several containers. This withdraws from your quota the CPU and RAM it uses, as well as one parallel activity. The quotas used for the containers will be freed when the job finishes. You must close (unload) webapps and notebooks to free the resources they’re using.

Note

Jobs on partitioned datasets launch several containers and count as several parallel activities. Dataiku will process as many partitions as available parallel activities in your quota simultaneously in a dedicated container. You can limit the maximum number of parallel activities requested by the recipe in the Advanced tab.

Elastic AI Compute metrics#

To better understand how the Elastic AI Compute quota is used, an extension is available. To install it:

  1. In the Cloud Launchpad, navigate to the Extensions panel.

  2. Click + Add an Extension.

  3. In Advanced features category, select Elastic AI Compute Metrics.

  4. Click Add.

This extension will create within your Design node an S3 connection named elastic-ai-compute-metrics usable by the space_administrators group.

To import the metrics dataset in a new project:

  1. Click on Connect or create.

  2. In Add new dataset, click on Amazon S3.

  3. In S3 connection, choose elastic-ai-compute-metrics.

  4. Click on Test to retrieve the format.

  5. Optionally, if you want to use daily partitioning:

    • Click on Partitioning tab, then Activate Partitioning.

    • Fill in Pattern with year=%Y/month=%_M/day=%_D/..

    • Click on Add Time Dimension, choose date and DAY as Period.

    • Click on List Partitions to validate the partitioning scheme.

  6. In New dataset name, fill in the name of your choice, and then click Create.

This will create a dataset with a line every minute for every containerized execution (pod) running with the following columns:

Column

Description

timestamp

dss_version

The three digit DSS version (for example, 14.1.4).

namespace

The K8s namespace where the execution is being done. Values can be:

  • space-xxxxxxxx-dku-compute for Design and pre-production nodes

  • space-xxxxxxxx-dku-compute-automation for Automation nodes

  • space-xxxxxxxx-dku-compute-gpu for GPU

node_id

The node for which the pod is running. Values can be design, automation, or pre-prod-automation.

project_key

The project key in lowercase which launched the containerized execution (pod).

pod

The name of the pod when submitted from the Dataiku node to the Elastic AI Compute cluster.

requested_cpu

The CPU value of the containerized execution configuration. Leave empty in case of GPU type containerized execution configuration.

requested_ram

The RAM value of the containerized execution configuration expressed in byte. Leave empty in case of GPU type containerized execution configuration.

requested_gpu

The number of GPU in the containerized execution configuration. It’s only filled when using GPU type containerized execution configuration.

used_cpu

The real CPU usage of the container (average on the last 5 minutes). It can be empty when the pod is starting.

used_ram

The real RAM usage at the timestamp (not an average). It can be empty when the pod is starting.

See also

Additional columns execution_type, job_id, activity_id, execution_id, install_id, code_studio_id, webapp_id, analysis_id, mltask_id, mltask_session_id, and submitter are defined in the pod labeling page of the reference documentation.

In addition, activating the extension automatically installs in your space a project named Cloud Resources Usage Monitoring. It provides a generic dashboard to understand your Elastic AI Compute usage globally and per project.

Note

Retention of this dataset is one year. If you need more retention, we advise you to sync that data to a connection where you can manage the retention.