Concept | Workflow documentation in a wiki#
Your project workflow should be documented before deploying it to production. A well-documented workflow:
Facilitates reproducibility.
Eases maintenance.
Supports collaboration between team members.
This article presents steps to document your workflow using a project wiki. Throughout, you’ll notice we’ve used the sample project Detect Credit Card Fraud. Using this sample project, we’ll walk through what could appear in a project wiki. Your own wikis may look very different!
Project goals#
The first section of our wiki documents the project goals. The documentation helps stakeholders understand the purpose of the project.
As an example, our wiki contains information that answers the following:
What is the purpose of the project, including project goals?
Who will be using the project in production?
What problem does the project solve?
Code environment and plugins#
Next, we’ll document our project’s code environment and plugins to ensure that the development and production environments are identical.
Code environment#
As described in Automation nodes, the code environment on an Automation node can be versioned, and each project bundle can be linked to a specific version of a code environment. For these reasons, we’ll want to document our code environment.
Our sample project has a single Python code environment. Our wiki includes the following information:
Environment name
Python version
Required packages
Plugins#
When we use a plugin anywhere in our project, we’ll need to document it to ensure the plugin is added to the production environment.
In our wiki, we’ve manually listed the plugins that we used to design our workflow. Our plugins were installed from the Dataiku plugin store. Datasets, recipes, processors, custom formula functions, and more can be added through plugins. This makes the use of plugins not always obvious and therefore essential to document.
Data sources#
Unexpected behavior can happen when the databases in our development and production environments have different schemas. Documentation of these independent versions of our databases can help prevent unexpected behavior.
Data source documentation should include the following:
Data source
Data availability
Data ownership
Schema
Column description
Data connection configuration between Dataiku and the database
Note
Descriptions can be handy. You can add descriptions throughout your project, including in the project’s homepage, the summary tab of a dataset, column details, and in the code of your custom recipes.
Data processing#
The time-consuming nature of workflow design involves making decisions — decisions that can become lost or forgotten if they are not documented. Documentation of dataset preparation and computation provides the necessary transparency for maintenance and improvement of the workflow. The documentation could also be used to help reproduce or restore the workflow.
Our data processing section documents:
How each input dataset was prepared.
How each output dataset was computed.
ML modeling#
We make many decisions during the development of a machine learning model. We might iterate on a model’s design many times and make multiple design choices with each new iteration. We quickly forget the decisions behind each iteration and why each model version exists.
Documentation design decisions provides transparency in the MLOps process. We can take advantage of model documentation features in Dataiku to generate machine learning model documentation.
Model summary#
Our goal for our model summary documentation is to help stakeholders identify the following model information:
The dataset the model was trained on.
What the model does.
How the model was built, tuned, and validated, including which features were selected.
To document our model, we used the Model Document Generator to generate a Microsoft Word™ .docx
file. We then attached the file to the wiki.
Note
To use the Model Document Generator, Dataiku must be set up to export images. For more information, visit Setting Up Dataiku Item Exports to PDF or Images.
Model behavior and monitoring#
Our goal for our model behavior documentation is to help stakeholders identify the following model information:
Which features have the most significant impact on the prediction?
How does the model behave with different inputs?
Was the model designed with responsible AI and fairness in mind?
If new data is significantly different from the data used to train the model, the model will likely no longer perform well. Therefore, stakeholders will also want to know how we plan to monitor model behavior, including model drift.
In addition, our documentation describes the reason for monitoring model behavior. This includes the following:
Model monitoring frequency
Expected performance drift (in metrics)
Expected prediction drift
We’ve also documented that our project uses a specific plugin to examine if new data waiting to be scored has diverged from the training data.
Dashboards#
You can document your dashboards. Our Dashboards section includes the following information:
Dashboard title and purpose
Steps to create the insights published to the dashboard
Whether or not dashboards are re-created in production
Scenarios#
Scenarios are the basis for production automation. We may wish to add the following information to the wiki:
A diagram of the Flow
Data quality rules and/or model metrics and checks, as applicable
Scenario settings and steps
Scenario trigger
Scenario reporter including the email template
We can also document if the scenario’s triggers are enabled, disabled, or left alone when activating a bundle on the Automation node.
Deployment#
Moving into production is an iterative process. There are many reasons for documenting deployment. One reason is being able to roll back to a prior version. For example, stakeholders will want to understand how the project bundle is deployed to the Automation node and how it is versioned.
For our sample use case, we’ve included the following deployment documentation:
Deployer infrastructure description
API Deployer
API services
Naming conventions
Versioning
Project Deployer
Project bundles
Naming conventions
Versioning
Project Version Control
Metadata, including information about the last change
What’s next?#
To help ensure our project components are reproducible in production, we can maintain our wiki throughout the MLOps process. Documentation can help stakeholders overcome some of the challenges they are likely to face including training data that can’t be reproduced, scenario failures, and model training failures.
Note
For further reading, visit the reference documentation on wikis.