Concept | Workflow documentation in a wiki#

Your project workflow should be documented before deploying it to production. A well-documented workflow:

  • Facilitates reproducibility

  • Eases maintenance

  • Supports collaboration between team members

This article presents steps to document your workflow using a project wiki. Throughout, you’ll notice we’ve used the sample project Detect Credit Card Fraud. Using this sample project, we’ll walk through typical sections of a project wiki.

Project goals#

The first section of our wiki documents the project goals. The documentation helps stakeholders understand the purpose of the project.

As an example, our wiki contains information that answers the following:

  • What is the purpose of the project, including project goals?

  • Who will be using the project in production?

  • What problem does the project solve?

../../_images/project-goals-wiki.png

Code environment and plugins#

Next, we’ll document our project’s code environment and plugins to ensure that the development and production environments are identical.

../../_images/code-env-wiki.png

Code environment#

As described in Automation nodes, the code environment on an Automation node can be versioned, and each project bundle can be linked to a specific version of a code environment. For these reasons, we’ll want to document our code environment.

Our sample project has a single Python code environment. Our wiki includes the following information:

  • Environment name

  • Python version

  • Required packages

Plugins#

When we use a plugin anywhere in our project, we’ll need to document it to ensure the plugin is added to the production environment.

In our wiki, we’ve manually listed the plugins that we used to design our workflow. Our plugins were installed from the Dataiku plugin store. Datasets, recipes, processors, custom formula functions, and more can be added through plugins. This makes the use of plugins not always obvious and therefore essential to document.

Data sources#

Unexpected behavior can happen when the databases in our development and production environments have different schemas. Documentation of these independent versions of our databases can help prevent unexpected behavior.

Data source documentation should include the following:

  • Data source

  • Data availability

  • Data ownership

  • Schema

  • Column description

  • Data connection configuration between Dataiku and the database

../../_images/data-source-wiki.png

Note

Descriptions can be handy. You can add descriptions throughout your project, including in the project’s homepage, the summary tab of a dataset, column details, and in the code of your custom recipes.

Data processing#

The time-consuming nature of workflow design involves making decisions — decisions that can become lost or forgotten if they are not documented. Documentation of dataset preparation and computation provides the necessary transparency for maintenance and improvement of the workflow. The documentation could also be used to help reproduce or restore the workflow.

Our data processing section documents:

  • How each input dataset was prepared.

  • How each output dataset was computed.

../../_images/data-processing-wiki.png

ML modeling#

We make many decisions during the development of a machine learning model. We might iterate on a model’s design many times and make multiple design choices with each new iteration. We quickly forget the decisions behind each iteration and why each model version exists.

Documentation design decisions provides transparency in the MLOps process. We can take advantage of model documentation features in Dataiku to generate machine learning model documentation.

Model summary#

Our goal for our model summary documentation is to help stakeholders identify the following model information:

  • The dataset the model was trained on.

  • What the model does.

  • How the model was built, tuned, and validated, including which features were selected.

../../_images/ml-modeling-wiki.png

To document our model, we used the Model Document Generator to generate a Microsoft Word™ .docx file. We then attached the file to the wiki.

Note

To use the Model Document Generator, Dataiku must be set up to export images. For more information, visit Setting Up Dataiku Item Exports to PDF or Images.

Model behavior and monitoring#

Our goal for our model behavior documentation is to help stakeholders identify the following model information:

  • Which features have the most significant impact on the prediction?

  • How does the model behave with different inputs?

  • Was the model designed with responsible AI and fairness in mind?

If new data is significantly different from the data used to train the model, the model will likely no longer perform well. Therefore, stakeholders will also want to know how we plan to monitor model behavior, including model drift.

In addition, our documentation describes the reason for monitoring model behavior. This includes the following:

  • Model monitoring frequency

  • Expected performance drift (in metrics)

  • Expected prediction drift

We’ve also documented that our project uses a specific plugin to examine if new data waiting to be scored has diverged from the training data.

../../_images/model-behavior-wiki.png

Dashboards#

You can document your dashboards. Our Dashboards section includes the following information:

  • Dashboard title and purpose

  • Steps to create the insights published to the dashboard

  • Whether or not dashboards are re-created in production

../../_images/dashboards-wiki.png

Scenarios#

Scenarios are the basis for production automation. We may wish to add the following information to the wiki:

  • A diagram of the Flow

  • Data quality rules and/or model metrics and checks, as applicable

  • Scenario settings and steps

  • Scenario trigger

  • Scenario reporter including the email template

We can also document if the scenario’s triggers are enabled, disabled, or left alone when activating a bundle on the Automation node.

Deployment#

Moving into production is an iterative process. There are many reasons for documenting deployment. One reason is being able to roll back to a prior version. For example, stakeholders will want to understand how the project bundle is deployed to the Automation node and how it is versioned.

For our sample use case, we’ve included the following deployment documentation:

  • Deployer infrastructure description

  • API Deployer

  • API services

  • Naming conventions

  • Versioning

  • Project Deployer

  • Project bundles

  • Naming conventions

  • Versioning

  • Project Version Control

  • Metadata, including information about the last change

What’s next?#

To help ensure our project components are reproducible in production, we can maintain our wiki throughout the MLOps process. Documentation can help stakeholders overcome some of the challenges they are likely to face including training data that can’t be reproduced, scenario failures, and model training failures.

Note

For further reading, visit the reference documentation on wikis.