Solution | Leveraging Compute Resource Usage Data#

Overview#

This article will guide you through a solution that will help you keep track of the resource consumption of your Dataiku platform.

Introduction to Resource Consumption#

Initially, you may use Dataiku as a standalone application that will handle the data storage, computing, and training of models. However, as your objectives grow, you will want to integrate the platform into a wider ecosystem.

At first, you will typically leverage external data sources (such as remote file storage, relational databases, data warehouses, and so on).

Soon, you will start having more and more computation to do, requesting more power. With the local DSS engine being inherently limited, you will need to leverage external computation technologies. The main, and foremost, one that Dataiku uses is Spark on Kubernetes, also known as Elastic AI.

Additionally, in order to properly segregate teams and have dedicated UAT or Production servers, you will deploy several Dataiku instances.

This strategy characterizes the path to success with Dataiku. However, it comes at a cost. Monitoring the resources used by your Dataiku instances is a critical topic, especially when the number of projects and users is growing.

Fortunately, Dataiku has an embedded capability to keep track of all those resource requests. This is called Compute Resource Usage (or CRU). In order to get familiar with this, you will need to read the reference documentation on Compute resource usage reporting. Compute resource usage reporting logs enable the monitoring of the resource down to the project and job level.

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

This solution is not available on Dataiku Cloud. Although you may try to import the zip file found in the self-managed instructions onto a Cloud instance, Dataiku offers no support in this case.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 13.2+ instance.

Workflow Overview#

This project has the goal to help you leverage the compute resource usage logs to understand which user or project is consuming the most resources. Once you’ve completed this walkthrough, you should have enough knowledge to venture into this solution yourself and replicate it without making any changes.

The project answers multiple needs with a focus on three main ones:

Need

Purpose

Reliability of Services/ Capacity Planning

Understanding what project/user/activity consumes resources during peak RAM/CPU usage. We are aiming at optimizing the computation in order to deploy the platform on the right machine (CPU/RAM) or with the right sizing of Kubernetes Clusters.

Cost Reduction (Finops)/ Green AI

Analyzing the duration of computations and the RAM consumption over time to monitor costs, whether financial or environmental.

Investigate usage patterns

Performing usage pattern and peak usage investigation.

As a plug & play solution, the user needs to connect their compute_resource_usage_logs that either come from an event server or directly from the audit logs of their instance. After some data preparation, three datasets, defining the main categories, will be separately worked on in their respective workflows: Local resources, SQL Resources, and Kubernetes Resources. They are afterwards stacked into one dataset to analyze the Cumulative Usage over the total time of the processes. One last Flow zone, Instant Usage Analysis, analyzes the minute-by-minute logs.

View of the full Flow that this article details.

As a result of this data preparation, dashboards are generated for the user to better analyze which project/user is consuming resources, and on which infrastructure. Moreover, a Project Setup is also included in this solution in order to help you connect your logs and build the dashboard.

CRU Explained#

If you have read the reference documentation, you know that CRU logs are natively gathered through the audit centralization mechanism, and stored in the audit logs that you can find in datadir/run/audit/.

Those logs monitor the CPU and memory usage of local and Kubernetes logs. For SQL computations, it provides the execution time (which might include waiting time). The minimal unit of time measure that you will see in this project is aggregation over minutes.

For each process, a trace is produced:

Trace

Initiates when…

compute-resource-usage-start

A process starts.

compute-resource-usage-update

Every minute over the duration of the process.

compute-resource-usage-complete

A process ends.

Although CRU logs are sent to the audit centralization mechanism, we strongly advise you to install an Event server and configure your auditing target’s destination to point to your Event server (Administration > Settings > Auditing).

It comes with several benefits:

  • Keep the history of the logs (the default audit directory is regularly flushed).

  • You can monitor all your nodes with the same Event server and group the analysis into the same dashboard.

  • The Event server allows dispatching the logs in different folders according to their topics and date. It will help load fewer data into the Flow and easily partition your datasets.

Finally, to monitor the Kubernetes processes, you’ll need to enable periodic reporting in your Settings. Then, the platform will periodically (every minute) query each Kubernetes cluster for a list of pods and the current resource consumption of each pod, and will emit a dedicated CRU message containing the list of pods with their current resource consumption.

Walkthrough#

Now that this is clear, let’s dive deeper into the Flow.

Input Data#

In the Input Data zone, the first Python recipe makes sure that whatever the origin of the CRU logs, we get the same schema and only keep CRU-related rows. It’s not mandatory or very complex, but a practical step.

Dataiku screenshot focused on the input zone of the Flow.

The next recipe is a normalized recipe used to clean the data and pre-compute the date at day, hour, and minute levels. It is done early in the Flow so that we have the possibility to use SQL engines later on.

The Split recipe splits the other rows and pushes them into three main categories:

  • One for all SQL processing (logs_sql_connection)

  • One for local processing (logs_local_process)

  • One for the local Kubernetes processing (logs_k8s)

In the end, there is nothing very complex here — mostly data splitting.

One thing to note is that some rows from the original CRU data are dropped and never used. This is normal and concerns the start and finishing of Kubernetes task reports that are of no use.

Kubernetes, SQL, and Local Resources#

Each of those categories will have a specific format of logs and, therefore, specific data preparation in their dedicated Flow zones. Let’s look at each of them.

Dataiku screenshot focused on the Kubernetes, SQL, and local resources zones of the Flow.

Regarding the Kubernetes Resources, most of the work is done in the Prepare recipe, which is organized into two groups of operations: Instant Usage of Resources and Cumulative Usage of Resources.

The Group recipe aggregates each pod resource used by minute and by process, whereas the Windows recipe is used to take the latest log trace related to a process.

Dataiku screenshot showing the Kubernetes resources Flow zone.

If you are further interested in Kubernetes metrics and their usage, we can recommend the great blog entry from Datadog Collecting Metrics With Built-in Kubernetes Monitoring Tools.

The goal of the SQL Resources zone is to measure the execution time for all SQL processes. A Sample/Filter Recipe is applied to get the completed processes. Then, a Compute Recipe is added to get the execution time.

Dataiku screenshot showing the SQL resources Flow zone.

The Local Resources zone is very similar to the Kubernetes zone, as it also mostly relies on some cleanup and computation done in the Prepare recipe.

The Group recipe is used to make sure we only keep one row per process per minute with the maximum resource usage for our minute-by-minute calculation.

The Window recipe takes the latest log trace related to a process. It is more efficient than filtering on “complete” jobs because it also monitors aborted jobs that do not have a “complete” trace.

Lastly, we are computing the intervals between two consecutive identical jobs to monitor process frequencies on the thread leading to logs_local_process_frequencies_stats_prepared.

Dataiku screenshot showing the local resources Flow zone.

Instant and Cumulative Usage Analysis#

The Instant Usage Analysis and Cumulative Usage Analysis Flow zones are fueling two tabs of the dashboard.

Dataiku screenshot focused on the instant and cumulative usage analysis zones of the Flow.

The Instant Usage Analysis zone gathers the analysis regarding the minute-by-minute logs. This Flow zone gives us four datasets: Logs by day & hours, Logs by minute, Logs Top Memory usage, and Logs Top CPU usage. They allow the visualizations in the Instant Usage Analysis tab of the dashboard to help with capacity planning.

As for the Cumulative Usage Analysis zone, we stack the local, Kubernetes, and SQL processes into one dataset to analyze the cumulative resource usage over the total time of the processes. This analysis helps with resource optimization and diminishing financial costs and/or C02 impact.

Consumption Report#

The solution has a Resource Consumption Report dashboard, which helps you analyze which project/user is consuming resources, and on which infrastructure.

Sample Dashboard to do capacity planning.

The dashboard has four tabs:

Tab

Description

General Analysis

Has a general overview of your consumption with metrics, distribution of resources consumption, and most frequent jobs.

Instant Usage Investigation

Helps analyze your capacity planning. You can investigate which project, user, or recipe generated a CPU & memory peak. Thanks to the filter section, the user can truly understand the usage compared to date, execution location, project key, and user ID by filtering them in and out.

Cumulative resource usage over time

Focuses on analyzing which jobs consume the most resource over time, spot projects that should be optimized, and diminish your CO2 impact or financial cost.

SQL Details

Is a more specific dashboard that analyzes the SQL execution time from all your projects. It displays the daily SQL execution time per project and the distribution of the execution over the available SQL connections.

Dashboard filters are available on each tab in order to re-generate the charts on a specific subset of data (i.e. by user, on a certain date range, execution type, etc.).

Build Your Own Dashboard#

The solution’s Project Setup allows you to build your own dashboard in just a few clicks by connecting your logs folder.

Select your connection leading to the location of the logs, reconfigure the connections in the Flow to match the ones selected, configure their path within the connection, and build your Resource Consumption Report dashboard.

Build your own dashboard through the Project Setup

What’s next?#

As a next step, you might want to use this solution to run some daily usage monitoring. If you wish to do so, you’ll need to do some modifications to the project to improve its performance.

  1. Activate the time trigger on the Build scenario. (The Build scenario rebuilds the Flow when you want to run an analysis, and it needs to be triggered by the user to run the visualizations with the latest logs.)

  2. Change all the database connections to an SQL one (except for the input dataset).

  3. Change the execution engine to an SQL engine for all compatible recipes.

  4. Consider partitioning your dataset per day to make sure only the last slice of data is computed each day.

In a FinOps approach (this introduction to FinOps will give you an idea of the importance of controlling the cost of AI projects), the ideal scenario is to discuss with other stakeholders (business, projects, IT, finance…) to review which information they need, and progressively enrich the dashboards.