Solution | Process Mining#

Overview#

Business case#

Process optimization to reduce costs and improve efficiency is a perennial priority for companies. During economic periods in which cost pressures are especially pressing, a focus on process optimization becomes all the more critical to ensure continuing business success and resilience.

The ever-increasing use of technology systems to manage key processes provides previously inaccessible opportunities for process analysis and optimization. These systems generate timestamped workflow logs as a byproduct of their primary task (for example, case management, process execution).

This in turn enables a shift from time-consuming and potentially erratic process evaluation techniques (for example, spot checks, time-in-motion studies) to modern, comprehensive, rapid, and statistically driven analytics via process mining. By leveraging timestamps at various stages along with a process Flow, process mining instantly creates a visual and statistical representation of any process, allowing teams to immediately undertake effective reviews of:

  • Process conformance

  • Root cause analysis

  • Target optimization

  • Bottleneck identification

A survey of CFOs revealed that 93% haven’t yet used process mining to map business processes. With the ever-increasing digitization of business models, the use of process mining to identify root causes, take effective action, and monitor impact opens a significant field of untapped opportunities for companies.

Installation#

  1. From the Design homepage of a Dataiku instance connected to the internet, click + Dataiku Solutions.

  2. Search for and select Process Mining.

  3. If needed, change the folder into which the Solution will be installed, and click Install.

  4. Follow the modal to either install the technical prerequisites below or request an admin to do it for you.

Note

Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Technical requirements#

To leverage this Solution, you must meet the following requirements:

  • Have access to a Dataiku 13.0+* instance.

  • A Python 3.8 or later version code environment with pandas 1.3 named solution_process-mining with the following required packages:

    cairosvg==2.5.2
    dash==2.9.3
    dash_bootstrap_components>=1.0
    dash_daq>=0.5
    dash_interactive_graphviz>=0.3
    dash_table==5.0
    Flask-Session==0.4.1
    graphviz==0.17
    pydot==1.4.2
    
  • The open-source graph visualization software Graphviz must also be installed on the same system running Dataiku.

Data requirements#

We initially built the Flow using real-world loan application event logs contained in an XES file. We sourced the data from the BPIC Challenge 2012. It represents a loan application process within a Dutch financial institution. This data will come pre-loaded into every instantiation of the Dataiku application packaged in the Solution, but it can be overwritten with your own event logs. The Solution accepts XES or CSV files for parsing. Logs taken as input for the process mining analysis must contain the following columns:

Mandatory column

Description

Case ID

Indicates the unique identifier of a trace.

Activity

Indicates a step in the process. Depending on the use case, you can interpret it as an action or a state, or a mix of both.

Timestamp

Indicates the timestamp for the activity. It could be the timestamp of an event or the start time/end time of an activity.

You can add optional columns to enrich the analysis, including:

Optional column

Description

End Timestamp

Indicates the end timestamp for an activity that must be greater than or equal to the above timestamp. The current implementation doesn’t support concurrent executions.

Sorting

Provides a sorting column for your activities. When multiple activities of the same case share a timestamp, it’s impossible to determine their order. If a sorting column isn’t provided, such cases will be dropped.

Resource

Indicates the person or cost center that executed the action. You could define it at a case level or at an activity level.

Numerical Attribute

Provides any external numerical information about the case. Examples of numerical attributes are the claim amount for a claim process, a loan amount for a loan application process, or an invoice amount for an accounts payable process.

Categorical Attribute

Provides any external categorical information about the case. Examples of categorical attributes are a type of claim for a claim process, a type of loan for a loan application process, or a type of supplier for an accounts payable process.

Workflow overview#

You can follow along with the Solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level steps:

  1. Input audit logs and parse the data.

  2. Analyze data and pre-compute statistics.

  3. Interactively explore processes and create insights with a visual pre-built tool.

  4. Run conformance checks and explore individual traces.

Walkthrough#

Note

In addition to reading this document, it’s recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Input and analyze audit logs#

After installation, you will find the Process Mining application under the Applications section of the home page of your Dataiku instance.

To begin, you will need to create a new instance of the Process Mining application. The project comes with sample data. You can replace it with your data, assuming that it adopts the data model described above. You can do this in one of two ways:

  1. Upload data directly from your filesystem in the first section of the Dataiku app.

  2. Connect to your database of choice by selecting an existing connection.

Once uploaded, click Reconfigure to trigger a scenario that will build the workflow_parsed dataset and (optionally) switch all datasets in the Flow to the selected connection. Once this scenario completes, refresh the page to update the column names for identification.

Dataiku screenshot of the Process Mining Dataiku application.

Column identification allows you to link the three mandatory columns to the appropriate columns in the input dataset. Then select either to include all other columns as attributes, or select attribute columns manually. For improved user experience, try selecting only those columns you think will be of most value as large numbers of columns selected here will make the visual interface more cluttered. It’s possible to select the columns used for case and activity as additional attributes. This will allow you to use them as filters later.

With columns correctly identified and parameters input, you can build the Flow. This will run all recipes in the Flow to output the final datasets needed to start the webapp. Once the Flow finishes building, you can directly access the Process Mining webapp by entering the dashboard directly from the Dataiku app. The Flow can always be rebuilt with new parameters from the Dataiku app.

Once you have defined a reference process, you can return to the Dataiku app to run a conformance check report to appear in the dashboard. You just need to select a saved reference process from the dropdown and click Run.

From logs to process discovery: Using the webapp#

Process mining often starts with large log files that can seem at first glance inextricably jumbled and too complex to parse. The process of discovery is the complex task of making sense of the data. You achieve this by using manual filters and selections to discriminate signal from noise, alongside algorithms that build a synthetic representation of a process.

The Process Discovery tab is the entry point to the webapp. It consists of a visualization screen displaying the process and a menu where filters and selectors can configure the visualization. You can find more details on each selector and filter in the Solution wiki. At a high level, the graph represents an aggregation of all the traces which match the set filters. Each node represents an activity with its name written inside, except two specific activities:

  • Start: All nodes linked to this activity are starting activities.

  • End: All nodes linked to this activity are ending activities.

Activities are also color-coded according to their frequency in the process graph. A legend at the top of the graph explains the color-coding. In the top-left corner, additional information is available about the traces that are displayed.

The first number is the number of traces actually displayed on the screen, which consist of filtered traces when including the variants filter and the second number is the total number of traces in the data, before any filtering happens.

Nodes are linked together with directed arrows, with each arrow having a number associated with it that indicates the number of transitions from the source node to the target, or the average time spent on the transition. You can toggle this number between frequency and time in the top-right corner. Clicking on the nodes or edges will result in pop-up windows appearing which will contain details and statistics about the nodes/edges.

Dataiku screenshot of the process discovery available in the webapp.

Within the menu of the webapp, you can do a few different things. In the menu, your have access to a variety of Selectors that will always select full traces — never only parts of them. The webapp comes with default preset selectors for every analysis, but you can include optional selectors by selecting additional columns from the workflow dataset through the Dataiku application.

You can also select the maximum number of variants that will visualize. Please note that plotting too many variants will fail as the graph will get too large.

The star icon on the top right lets you define a reference process. To export a visualization outside the webapp, you can save it using the top-right Save button.

Assessing process fit to real-world data: Conformance checks#

Conformance checking is the task of comparing a set of traces with a predefined process. It creates insights into how well real-world data conforms to a user-defined process and also monitors continuously how new traces fit the process. Conformance checking can take multiple forms, depending on the format of the saved process and business expectations.

Once you have saved a process in the Process Discovery tab, you can switch to the Conformance Checks tab. Here you’ll be able to click Run Conformance Checks, and apply the conformance calculations to the existing traces.

Doing so will run the saved process through the Conformance Checks Flow zone, which computes the conformance checks on the traces via a Python recipe. Back in the webapp, once the conformance checks have been completed, an aggregated conformance score will be plotted on the right side.

This shows the average conformance of the traces defined by the left-hand side selectors relative to the reference process (as described in the description above the graph). The user can adjust the time granularity of the graph (monthly, weekly, daily). In production, you could run these checks daily or in real-time to monitor the conformity of incoming data to a set process.

Dataiku screenshot of the conformance checks in the webapp.

Exploring individual traces#

Within the Conformance view of the webapp, yo can also explore individual traces (assuming you have run conformance checks). Selectors on the left-hand side will determine which traces will be displayed. The resulting table will show each trace’s case ID, conformance score from the conformance checks, start time, and aggregated attributes for all other optional columns.

You can sort the table on each column. When you click on a row from the table, a graph of the specific trace process will appear below with the same functionalities as the Process Discovery visualization.

Dataiku screenshot of the webapp view allowing for individual trace exploration.

Understand processes with visual insights#

The Solution includes a dashboard containing three tabs:

Tab

Description

Process Summary

Contains an initial descriptive analysis of the process logs/data.

Conformance

Provides a conformance report per selected reference process from the Dataiku App.

Webapp

Embeds the aforementioned webapp into a page of the dashboard to make sharing across the organization simple.

Dataiku screenshot of the process summary tab.

The Process Summary page depends on several datasets in the Flow to generate key metrics and charts which analyze the initial process data. Here you can find the number of traces in the data, the average time for process completion, descriptive analytics about the start and end activities, the frequency of activities, overall case performance, and more.

Dataiku screenshot of the Conformance Report tab of the dashboard.

The final section of the Dataiku app builds the Conformance page. It can generate a report for the reference process created and saved in the webapp. The report will contain the conformance graph showing the reference process with arrows indicating actual data Flow.

Also, metrics and charts are generated to identify the number of cases that have been analyzed, the share of cases that conform to the reference process, the average conformance score, and the evolution of conformance metrics. You can change and regenerate this report each time new data comes in or the reference process changes.

Automation#

Four scenarios automate parts of the computation of the Flow and are either triggered through the Dataiku application or the webapp. You can create reporters to send messages to Teams, Slack, email, etc. to keep your full organization informed. You can also run these scenarios ad-hoc as needed. You can find full details on the scenarios in the wiki.

Process mining workstreams often integrate additional data science techniques, such as automated anomaly detection or cluster analysis, to enrich process investigations and uncover deeper insights. Dataiku’s platform is ideally positioned to allow your team to pursue these additional paths and integrate the results into this process mining Solution.

Reproducing these processes with minimal effort for your data#

The intent of this project is to enable operations, strategy, and transformation teams to understand how they can use Dataiku to create a visual map of your processes based on readily available process logs.

By creating a singular Solution that can benefit and influence the decisions of a variety of teams in a single organization, you can design smarter and more holistic strategies to deep-dive into specific processes, analyze outliers, and apply powerful statistical techniques to enable remediation and optimization efforts.

This documentation has reviewed provided several suggestions on how to derive value from this Solution. Ultimately however, the “best” approach will depend on your specific needs and data. If you’re interested in adapting this project to the specific goals and needs of your organization, Dataiku offers roll-out and customization services on demand.