How to Leverage Compute Resource Usage Data¶
This article will guide you through a solution project that will help you keep track of the resource consumption of your Dataiku platform.
Introduction to Resource Consumption¶
As a beginner, you can use Dataiku as a standalone application that will handle the data storage, computing, and training of models. However, as your objectives grow, you will want to integrate the platform into a wider ecosystem.
At first, you will typically leverage external data sources (such as remote file storage, relational databases, data warehouses, and so on).
Soon, you will start having more and more computation to do, requesting more power. With the local DSS engine being inherently limited, you will need to leverage external computation technologies. The main and foremost one that Dataiku uses is Spark on Kubernetes, also known as Elastic AI.
Additionally, in order to properly segregate teams and have dedicated UAT or Production servers, you will deploy several DSS instances.
This strategy characterizes the path to success with DSS. However, it comes at a cost. Monitoring the resources used by your DSS instances is a critical topic, especially when the number of projects and users is growing.
Fortunately, Dataiku DSS has an embedded capability to keep track of all those resource requests. This is called Compute Resource Usage (or CRU).
In order to get familiar with this, you will need to read the product documentation on this matter: Compute resource usage reporting
Those compute resource usage reporting logs enable the monitoring of the resource down to the project and job level.
This solution can be installed on a Dataiku V10+ instance in one of two ways:
On your Dataiku instance click + New Project > Business solutions > Search for Dataiku Resource Usage Monitoring.
Download the .zip project file and upload it directly to your Dataiku instance as a new project.
This solution is only intended for installed instances and not compatible with Dataiku Online.
Overview of the Solution¶
This project has the goal to help you leverage the compute resource usage logs to understand which user or project is consuming the most resources. Once you’ve completed this walkthrough, you should have enough knowledge to venture into this solution yourself and replicate it without making any changes.
The project answers multiple needs with a focus on two main ones:
Reliability of Services/ Capacity Planning - Understanding what project/user/activity consumes resources during peak RAM/CPU usage. We are aiming at optimizing the computation in order to deploy the platform on the right machine (CPU/RAM) or with the right sizing of Kubernetes Clusters.
Cost Reduction (Finops)/ Green AI - Analysing the duration of computations and the RAM consumption over time to monitor costs, whether financial or environmental.
As a Plug & Play solution, the user needs to connect its compute_resource_usage_logs that either come from an event server or directly from the audit logs of their instance. After some data preparation, three datasets, defining the main categories, will be separately worked on in their respective workflows: Local resources, SQL Resources, and Kubernetes Resources. They are afterward stacked into one dataset to analyze the Cumulative Usage over the total time of the processes. One last flow zone, Instant Usage Analysis, analyzes the minute-by-minute logs.
As a result of this data preparation, the Dashboards are generated for the user to better analyze which project/user is consuming resources, and on which infrastructure. Moreover, a Dataiku Application is also included in this solution in order to help you connect your logs and build the dashboard.
If you have read the documentation, (you should really do it now if you have not: Compute resource usage reporting), you know that CRU logs are natively gathered through the audit centralization mechanism, and stored in the audit logs that you can find in your datadir/run/audit/.
Those logs monitor the CPU and Memory usage of Local and Kubernetes logs. For SQL computations, it provides the execution time (which might include waiting time). The minimal unit of time measure that you will see in this project is aggregation over minutes.
For each process, a trace is produced :
compute-resource-usage-start - When a process starts.
compute-resource-usage-update - Every minute over the duration of the process.
compute-resource-usage-complete - When a process ends.
Although CRU logs are sent to the audit centralization mechanism, we strongly advise you to install an Event server and configure your auditing target’s destination to point to your Event server (Administration > Settings > Auditing). It comes with several benefits :
Keep the history of the logs (the default audit directory is regularly flushed).
You can monitor all your nodes with the same event server and group the analysis into the same dashboard.
The Event Server allows dispatching the logs in different folders according to their topics and date. It will help load fewer data into the flow and partition easily your datasets.
Finally, to monitor the Kubernetes processes, you’ll need to enable periodic reporting in your Settings. Then, the platform will periodically (every minute) query each Kubernetes cluster for a list of pods and the current resource consumption of each pod, and will emit a dedicated CRU message containing the list of pods with their current resource consumption.
Now that this is clear, let’s deep dive into the flow.
In the Input Data zone, the first Python recipe makes sure that whatever the origin of the CRU logs, we get the same schema and only keep CRU-related rows. It’s not mandatory or very complex, but a practical step.
The next recipe is a Normalized recipe used to clean the data and precompute the date at day, hour, and minute levels. It is done early in the flow so that we have the possibility to use SQL engines later on.
The Split recipe splits the other rows and pushes them into three main categories:
one for all SQL processing (logs_sql_connection)
one for local processing (logs_local_process)
one for the local Kubernetes processing (logs_k8s)
In the end, there is nothing very complex here–mostly data splitting.
One thing to note is that some rows from the original CRU data are dropped and never used. This is normal and concerns the start and finishing of Kubernetes task reports that are of no use.
Each of those categories will have a specific format of logs, therefore specific data preparation, in their dedicated flow zones. Let’s deep dive into each of them.
Regarding the Kubernetes Resources, most of the work is done in the Prepare recipe, which is organized into two groups of operations: Instant Usage of Resources and Cumulative Usage of Resources.
The Group recipe aggregates each pod resource used by minute and by process whereas the Windows recipe is used to take the latest log trace related to a process.
If you are further interested in Kubernetes metrics and their usage, we can recommend the great blog entry from Datadog Collecting Metrics With Built-in Kubernetes Monitoring Tools.
The SQL Resources zone has for goal to measure the execution time for all SQL processes. A Filter Recipe is applied to get the completed processes, then, a Compute Recipe is added to get the execution time.
The Local Resources zone is very similar to the Kubernetes zone, as it also mostly relies on some cleanup and computation done in the Prepare recipe.
The Group recipe is used to make sure we only keep one row per process per minute with the maximum resource usage for our minute-by-minute calculation.
The Window recipe takes the latest log trace related to a process. It is more efficient than filtering on “complete” jobs because it also monitors aborted jobs that do not have a “complete” trace
Lastly, we are computing the intervals between two consecutive identical jobs to monitor process frequencies on the thread leading to logs_local_process_frequencies_stats_prepared.
The Instant Usage Analysis and Cumulative Usage Analysis flow zones are fueling two tabs of the Dashboard.
In the Instant Usage Analysis, we are gathering the analysis regarding the minute-by-minute logs. The flow zone gives us four datasets: Logs by day & hours, Logs by minute, Logs Top Memory usage, and Logs Top CPU usage. They allow the visualizations in the Instant Usage Analysis tab of the Dashboard to help with capacity planning.
As for the Cumulative Usage Analysis, we Stack the Local, Kubernetes, and SQL processes into one dataset to analyze the cumulative resource usage over the total time of the processes. This analysis helps with resource optimization and diminishing financial costs and/or C02 impact.
The solution has a Resource Consumption Report dashboard which helps you analyze which project/user is consuming resources, and on which infrastructure.
The dashboard has four tabs:
General Analysis - This tab has a general overview of your consumption with metrics, distribution of resources consumption, and most frequent jobs.
Instant Usage - This tab help analyze your capacity planning. You can investigate which project, user, or recipe generated a CPU & memory peak. Thanks to the filter section, the user can truly understand the usage compared to date, execution location, project key, and user ID by filtering them in and out.
Cumulative Usage - This one focuses on analyzing which jobs consume the most resource over time, spot projects that should be optimized, and diminish your CO2 impact or financial cost.
SQL Details - This is a more specific dashboard that analyzes the SQL execution time from all your projects. It displays the daily SQL execution time per project and the distribution of the execution over the available SQL connections.
Dashboard Filters are available on each tab in order to re-generate the charts on a specific subset of data (i.e. by user, on a certain date range, execution type, etc.).
The Dataiku Application allows you to build your own dashboard in just a few clicks by connecting your logs folder.
Select your connection leading to the location of the logs, reconfigure the connections in the flow to match the ones selected, configure their path within the connection, and build your Resource Consumption Report dashboard.
As a next step, you might want to use this solution to run some daily usage monitoring. If you wish to do so, you’ll need to do some modifications to the project to improve its performance.
Activate the time trigger on the Build scenario. (The Build scenario rebuilds the flow when you want to run an analysis and it needs to be triggered by the user to run the visualizations with the latest logs.)
Change all the database connections to an SQL one (except for the input dataset).
Change the execution engine to an SQL engine for all compatible recipes.
Consider partitioning your dataset per day to make sure only the last slice of data is computed each day.
In a FinOps approach (this introduction to FinOps will give you an idea of the importance of controlling the cost of AI projects), the ideal scenario is to discuss with other stakeholders (business, projects, IT, finance…) to review which information they need, and progressively enrich the dashboards.