Solution | Maintenance Performance and Planning#

Overview#

Business Case#

Equipment reliability is crucial for manufacturers — ensuring responsible, safe, and consistent manufacturing processes. Maintenance is key to mitigating unplanned downtime and ensuring safe, continuous operation. However, finding the right time for maintenance is a challenge, as it requires balancing operational and cost objectives.

In many industries, maintenance is either reactive or driven by excessive time-based preventative routines. Both options are costly, erode asset performance, and lower operational efficiency, costing manufacturers billions each year.

Time-based preventative maintenance and reactive firefighting represent two default strategies that no longer need to be the norm. Using AI and ML, manufacturers can refine their maintenance tactics by leveraging service history and equipment attributes. Techniques like survival analysis transform static, time-based maintenance schedules into tailored plans that reflect the true risk of mechanical failure for each asset.

With Dataiku’s Maintenance Performance and Planning solution, organizations can quickly leverage vast volumes of maintenance history and internal documents to draw insights on how to improve their maintenance performance. Thanks to common performance metrics like MTBF (Mean Time Between Failures), MTTR (Mean Time to Repair), and task paretos, reliability engineers can easily explore fleet behaviors using descriptive analytics. ML algorithms provide Remaining Useful Life (RUL) predictions based on maintenance history and recommend maintenance schedules per asset, allowing service managers to adjust strategies accordingly.

Dataiku’s LLM Mesh enables all business users to interpret model results and generate automatic reports. Whether for internal equipment maintenance or improving customer service, Dataiku’s Maintenance Performance and Planning solution empowers organizations to revisit their manufacturing strategies promptly and effectively.

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

chromadb==0.5.5
cloudpickle==3.0.0
faiss-cpu==1.8.0.post1
Flask==3.0.3
Jinja2==3.1.4
langchain==0.1.11
lightgbm==4.5.0
lifelines==0.28.0
markdown2==2.5.0
plotly_calplot==0.1.20
pydantic<2
pinecone-client<3
pysqlite3-binary; platform_system == "Linux"
protobuf==3.20.3
qdrant_client==1.11.1
scikit-learn>=1.0,<1.1
scikit-survival==0.17.2
statsmodels==0.14.1
tiktoken==0.7.0
xgboost==2.1.1
  • LLM Mesh connection: This solution uses Dataiku’s LLM Mesh to interact with local or remote models. One connection needs to be configured to run the project.

Data Requirements#

The solution takes as input two datasets.

Dataset

Description

maintenance_operations

Contains logs of all the maintenance operations in the following format:

  • equipment_id (string): ID of the machine

  • equipment_stop_time (date): Start of maintenance

  • equipment_restart_time (date): End of maintenance

  • is_planned (boolean): Indicates whether maintenance is planned

  • maintenance_operation (string): Maintenance category. Entries should be predefined types or part names — not raw text.

equipment_information

Encompasses pertinent static details of each piece of equipment. There should be one row per equipment, and date columns should be pre-parsed to achieve the following data model:

  • equipment_id (string): Equipment’s unique ID

  • XXX (string / date / boolean / float): Include any additional columns essential for maintenance analysis. Feel free to add multiple columns as necessary.

Optionally, you can drag and drop documents (text files, PDFs, or images containing text) into the context_documents folder. These documents will be used by the various LLMs to help explain your statistical results.

Workflow Overview#

You can follow along with the solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level steps:

  1. Select your analysis parameters using Project Setup.

  2. Prepare the data for survival analysis and compute statistical indicators.

  3. Train the survival model.

  4. Analyze the results using Dataiku’s LLM Mesh and display the result in a dashboard.

Walkthrough#

Note

In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Define your maintenance goals via the Project Setup#

To begin, you must configure the solution in the project landing page by using Project Setup. You will land on the configuration page, where you can indicate your analysis parameters.

Dataiku screenshot of the Project Setup Application for Maintenance Performance and Planning

Here you can configure important parameters to guide your maintenance strategy. First, the End Date of the Analysis allows you to specify the latest date for available data. This ensures that the system analyzes data up to the point you’ve selected.

Next, the Target Planned Maintenance Ratio sets the percentage of maintenance that should be planned, helping you optimize the balance between preventive and corrective maintenance. This value acts as a KPI against which you can compare your actual maintenance performance over time.

Finally, the Maximum Planned Interval defines the longest period within which maintenance can be scheduled. The planning will automatically adjust to respect this maximum interval.

Once these parameters are set, click Build to generate a tailored analysis. The dashboard will provide insights based on the data and constraints you defined.

Censor Events and Up time: Getting data ready for Survival Analysis#

In the context of survival analysis, censored values represent instances where the precise time to event (in this case, maintenance or failure of equipment) is not known. This could occur when a machine is still functional at the end of the observation period or when a machine is taken offline for planned maintenance. Because the machine didn’t fail, these instances are treated as censored data, indicating that the actual time to failure is potentially longer than what is observed.

To handle these censored values effectively, survival analysis needs to analyze periods instead of events: the different models compare the length of the periods and tries to find patterns in the data to account for differences.

Dataiku screenshot of the Flow zone dedicated to turning events data into periods for the Survival Analysis.

The analysis adopts two perspectives:

  1. Analysis by equipment: This Flow zone focuses on the time to maintenance for each piece of equipment. It computes uptime and handles boundary events to generate a comprehensive survival probability curve for each machine.

  2. Analysis by equipment & maintenance operation: This zone emphasizes the time to maintenance for each maintenance operation. It incorporates uptime and boundary events to assess the likelihood of each maintenance operation being required over time.

Anticipating Machine Failure: Training a Survival Analysis Model#

Survival analysis, a statistical technique used to evaluate the time-to-event, is particularly pertinent in analyzing maintenance performance. It effectively deals with censored data, a common occurrence in maintenance scenarios — such as when a machine has not yet failed or is taken offline for planned maintenance. This method enhances the overall predictive power of the model by treating these instances as censored data.

Moreover, survival analysis generates a survival probability curve that offers a detailed understanding of equipment longevity. Rather than merely predicting a single failure point, this curve allows for risk assessment over a duration, enabling more effective planning for maintenance schedules, resource allocation, and managing spare parts inventory.

The training of the survival analysis model involves several key steps:

  1. The uptime dataset is joined with the equipment information dataset, integrating relevant equipment data for the analysis.

  2. A ‘prepare’ phase calculates the number of days elapsed between each date and the machine’s restart time, effectively converting date values into duration metrics for modeling. Unneeded columns are also removed at this stage.

  3. The Visual ML is used on the periods_with_covariates_prepared dataset. It builds a Cox proportional hazards model, which is then deployed as a recipe in the Flow.

These steps ensure the model is adequately trained to leverage all maintenance logs and equipment data, thereby improving the overall reliability and precision of the solution.

Dataiku screenshot of the Flow zone dedicated to training the Survival Analysis model using the Visual ML lab feature of Dataiku.

Design smarter and more effective maintenance strategies#

The pre-built Maintenance Operations Analysis and Predictions dashboard that is packaged within this solution has been designed to support reliability engineers in the effective maintenance management of their equipment fleet. The dashboard is divided into three pages.

Dataiku screenshot of a proposed maintenance schedule available in the General Overview page.

The General Overview dashboard serves as an initial summary and a strategic tool for managing and optimizing maintenance operations. It provides a high-level view of the key performance indicators (KPIs), enabling users to quickly gauge equipment reliability, issue resolution efficiency, and overall equipment availability. By tracking maintenance trends and leveraging predictive models to forecast future maintenance schedules, it facilitates data-driven decision-making.

Dataiku screenshot of charts summarizing the descriptive statistical analysis computed on our fleet.

The Maintenance Operations Analysis page provides a comprehensive analysis of maintenance operations and equipment performance, helping users to uncover detailed insights and make informed decisions. It offers a platform to investigate individual equipment performance, identify failure-prone equipment, and understand the dynamics of various maintenance operations.

By helping users pinpoint patterns and anticipate potential problems, it enables the formulation of targeted maintenance strategies. This dashboard is particularly useful for those seeking to enhance operational efficiency, reduce unscheduled downtime, and improve overall equipment lifespan.

Dataiku screenshot exposing the influence of different attributes on unplanned maintenance.

The Maintenance Performance Determinants dashboard provides an analytical tool for understanding the impact of specific factors, or covariates, on the likelihood of equipment failure. It utilizes a Cox Proportional Hazards model to compute risk multipliers, which indicate how much a covariate increases or decreases the risk of failure.

For categorical covariates, each category has its own risk multiplier compared to a baseline category, while for numerical covariates, the risk multiplier quantifies how a one-unit increase or decrease in the covariate alters the hazard, assuming other factors remain constant.

The solution uses a combination of LLM calls to automatically interpret these risk multipliers, identify possible root causes in your contextual documents, and generate an automatic report that all stakeholders can benefit from.

This dashboard is essential for identifying and understanding the factors that significantly influence equipment failure. With the help of Dataiku’s LLM Mesh, decision-makers can leverage their data and documents to identify the key variables affecting equipment longevity, thereby enhancing maintenance strategies and extending equipment life.

Working with LLMs#

Working with LLMs is an opportunity but requires specific attention. Please be mindful about cost, privacy, and regulatory concerns. A small sample of data should also be used for testing to avoid unpredicted behavior and limit the cost of iteration. Prompts might need to be modified regarding your data or the model used. Lastly, a human-in-the-loop process is recommended before taking any actions based on results that rely on LLMs (directly or indirectly).

Reproducing these Processes With Minimal Effort For Your Own Data#

The intent of this solution is to enable an understanding of how Dataiku can be used to reduce unplanned downtime and optimize maintenance plans for your equipment. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, strategies can be implemented to mitigate unplanned downtime, ensure safe, continuous operations, and improve service to customers.

We have provided several suggestions on how to use your operations and equipment data to improve and optimize maintenance plans, but ultimately, the “best” approach will depend on your specific needs and your data. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.