Tutorial | Batch deployment#
Get started#
When you finish designing a project, it’s time to push it into production! As opposed to real-time scoring, let’s implement a batch processing production framework.
Objectives#
In this tutorial, you will:
Create a project bundle on the Design node.
Deploy the bundle on the Automation node using the Project Deployer.
Manage bundle versions between Design and Automation nodes.
Monitor the project’s status on the Deployer.
Prerequisites#
Dataiku 12.6 or later for importing the tutorial project.
An Automation node connected to the Design node.
Dataiku Cloud users can follow instructions for adding the Automation node extension.
Administrators of self-managed Dataiku instances should follow the reference documentation on Production deployments and bundles.
The Write project content permission on the project used in this tutorial to create a project bundle.
Intermediate knowledge of Dataiku (recommended courses in the Advanced Designer learning path or equivalent).
Create the project#
From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select Batch Deployment.
Click Install.
From the project homepage, click Go to Flow (or
g
+f
).
From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by MLOps Practitioner.
Select Batch Deployment.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Use case summary#
The project has three data sources:
Dataset |
Description |
---|---|
tx |
Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made. It also indicates whether the transaction has either been:
|
merchants |
Each row is a unique merchant with information such as the merchant’s location and category. |
cards |
Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US). |
Run the failing scenario on the Design node#
The existing Data Refresh scenario found in the project automates the build of the tx_windows dataset. However, it’s currently failing because of a data quality rule on the tx_prepared dataset.
From the Jobs menu in the top navigation bar, go to the Scenarios page.
Open the Data Refresh scenario.
Click Run to manually trigger it.
Go to the Last Runs tab to confirm the scenario run failed.
Normally we’d want to fix a failing scenario before deploying the project into production. In this case though, the error will be instructive. Let’s move the project into production!
Create a bundle and publish it to the Deployer#
See a screencast covering similar steps to those required here.
The first step is to create a bundle from the project found in the Design node (the development environment).
From the More Options (…) menu in the top navigation bar, choose Bundles.
Click + New Bundle.
Add additional content to the bundle#
A bundle acts as a consistent packaging of a complete Flow. By default, it includes only the project metadata. As a result, all datasets will come empty, and models will come untrained. However, depending on the use case, we can choose to include additional content such as datasets, managed folders, and saved models.
Unlike in most real-life projects that would be connected to some kind of database or cloud storage, our initial datasets are uploaded files and managed folders. Therefore, they won’t be re-computed from production sources. To access these files in the production environment, we’ll also need to include them in the bundle.
Provide the bundle ID
v1
.In the Additional Content section, to the right of Datasets, click Add. Choose cards.
To the right of Managed folders, click Add. Choose tx.
Click Create.
Important
Unlike tx and cards, the starting merchants dataset originates from a Download recipe, and hence Dataiku is able to recompute it in the production environment.
If the project included a saved model that we wanted to use for batch scoring (or real-time inference), we’d also need to include it in the bundle.
Publish the bundle to the Deployer#
The project on the Design node now includes a bundle. Although we could download this file and manually upload it to the Automation node, the strongly preferred method is to use the Project Deployer because it centralizes the history of all deployments.
From the Bundles page of the project on the Design node, select the v1 bundle.
Click Publish on Deployer, and then again confirm the same.
Instead of immediately opening the Deployer, click Done for now.
Tip
If you don’t have the ability to Publish on Deployer, return to the prerequisites for instructions on connecting your Design node to an Automation node.
Create and manage deployments#
Until now, your experience with Dataiku may have been limited to the Design node. However, as an end-to-end platform, Dataiku includes other nodes (Automation, API, Govern) for production-oriented tasks.
Explore the Deployer#
Before actually deploying the bundle to the Automation node, let’s take a look at how the Deployer, and in particular for this tutorial, the Project Deployer, fits into this process.
First, there are two modes for installing the Deployer:
One is a local Deployer that sits on top of either the Design or Automation node and requires no further setup. Dataiku Cloud employs this option.
The other is a standalone or remote Deployer for infrastructures with multiple Design and/or Automation nodes.
Regardless of which setup is found on your instance, the process for using the Project Deployer is the same.
If you are using a remote Deployer, make sure you are connected to this instance. You’ll need credentials from your instance administrator.
For either setup, from the bundle details page on the Design node, click Deployer where the publishing timestamp is recorded.
Tip
You can also always navigate to the Deployer by choosing Local/Remote Deployer in the waffle menu from the top navigation bar.
Although we have pushed a bundle to the Deployer, we don’t yet have an actual deployment. Before creating a new deployment, take a moment to explore the Deployer. If you are using a remote Deployer, note the change in the instance URL.
From the Project Deployer, click on Deployer at the top left to see how this node has separate components for deploying projects, deploying API services, and monitoring.
Click on Deploying Projects to view current deployments, projects that have published bundles, and available infrastructures.
Tip
The Deployer for your instance might already have projects and/or API services belonging to colleagues and other teams!
Create a new deployment#
Thus far, we’ve published a bundle from the Design node to the Project Deployer. To create an active deployment, we still need to push the bundle from the Project Deployer to an Automation node.
If not already there, click Deployments in the top navigation bar of the Deployer to view all project deployments on the instance.
In the Bundles to deploy panel on the left of the Project Deployer, find the v1 bundle for this project, and click Deploy.
Caution
If the Deploy button is not clickable, it means there is no infrastructure ready for deployment. Dataiku Cloud users need to add the Automation node extension. Self-managed users (or their instance admins) should consult the reference documentation on Deployment infrastructures.
Choose a Target infrastructure. This will vary depending on the infrastructure available to your organization.
Leave the default Deployment ID, which takes the form of
<PROJECTKEY>-on-<infrastructure>
.Click Create
Click Deploy and Activate.
Important
Just as when importing a project zip file to a Dataiku instance, you may see warnings about missing plugins or plugin version mismatches. If any of these plugins are used in the project at hand, you’ll want to closely review them. The same can be said for missing connections. See the article Concept | Automation node preparation for more details.
Manage deployment settings#
Your project is now running on the Automation node! You can click to open it from the panel on the left. Before doing so though, it is helpful to understand what deployment settings can be controlled from the Project Deployer itself.
Within the Project Deployer, we can monitor the status of deployments, such as when it was created (and by whom), when it was updated (and by whom), and recent scenario runs.
Note
The reference documentation also covers how to modify deployment settings from the Project Deployer.
Remap connections#
In the Settings tab of a deployment, you can configure criteria around variables, connections, code environments, and scenarios.
Connection remapping, for example, is one setting that will commonly need to be configured. Organizations often maintain different databases for development and production environments. If this is the case, you’ll need to remap the source connections used on the Design node to the target connections that should be used on the Automation node.
Within the new deployment, navigate to the Settings tab.
Navigate to the Connections panel on the left.
Tip
Assuming you are using the same data sources for development and production environments for this tutorial, no action is required here.
Manage scenario auto-triggers#
To ensure scenarios never run unexpectedly, all scenarios in a new deployment are deactivated by default — regardless of their settings on the Design node.
Remaining within the Settings tab of the deployment, navigate to the Scenarios panel.
Here you can enable, disable, or even override the behavior defined on the Automation node — giving you the option of how you want to manage the scenarios for a deployed project.
Leave the default setting (marked by a hyphen) that does not override the behavior defined on the Automation node.
Activate the scenario on the Automation node project#
Finally, let’s check out the project on the Automation node.
Ensure you are connected to a running Automation node.
Navigate back to the Status tab of the deployment.
Click to open the project on the Automation node.
Tip
Keep your Design, Deployer, and Automation nodes open in separate browser tabs!
Once on the Automation node, the project should look quite familiar. Confirm a few points:
The project homepage reports what bundle is running and when it was activated.
The scenario is not yet active.
Let’s activate the scenario in the Automation node project!
While in the Automation node version of the project, open the Data Refresh scenario.
In the Settings tab, make sure that the time-based trigger (every 1 minute) and the scenario’s auto-trigger setting are both enabled.
Click Save to activate it.
Monitor a batch deployed project#
Now that we have an active scenario running in production, we have to to monitor it. In addition to the scenario’s Last Runs tab, let’s examine additional monitoring tools at our disposal.
At the project-level#
The most granular view is within the project itself.
From the Jobs menu of the Automation node project, select Automation Monitoring.
Click Load Scenario Runs.
Explore the Daily Summary and Timeline tabs using the filters to view the progress of any scenarios within this project.
At the Project Deployer-level#
Instead of checking the progress on the Automation node, we can also monitor the progress of deployments from the Project Deployer.
Return to the Project Deployer.
From the Deployments tab, we can view information about the status of any deployment, including scenario runs.
Select the current deployment to see an overview in the Status tab.
Unified Monitoring#
It can also be helpful to have an instance-wide view of the health status of all deployed projects (as well as API services).
From the Project Deployer, click Deployer at the top left to go to the Deployer home.
Select Monitoring.
Navigate to the Dataiku Projects tab as that is our interest at the moment.
See that the project deployment has an OK deployment status, but an error for its execution and data statuses. This registers one project deployment in the error category in terms of global status.
Tip
Depending on how quickly you’ve made it through these steps, your Unified Monitoring screen may not yet show an error. The default synchronization interval is five minutes. Learn more in the reference documentation on Unified Monitoring.
Since we can see the scenario failing, let’s deactivate the automated trigger from the Project Deployer.
From the Dataiku Projects tab of the Unified Monitoring page, click on the row for the failing deployment to see more information.
Click on the deployment.
Navigate to the Settings tab.
Go to the Scenarios panel.
Click to Disable automatic triggers.
Click Save and Update, and then Confirm.
Tip
Check back on the homepage of the Automation project to confirm the scenario is no longer active.
Version a deployed project#
As you monitor the health of deployments, you’ll need to deploy updated versions of projects over time — especially since this one is already failing!
Where to make changes to a project#
When it is necessary to make a change to a deployed project, it’s critical to make all such changes in the development environment (the Design node), and then push a new bundle to the production environment (the Automation node).
It may be tempting just to make a quick change to the project on the Automation node, but you should avoid this temptation, as the project in the production environment would no longer be synced with its counterpart in the development environment.
Consider a situation where you want to revert back to an earlier version of the project. If you’ve made changes in the Automation node, these changes will be lost. Accordingly, actual development should always happen in the Design node, and new versions of bundles should be pushed from there.
Fix the failing scenario#
Taking this advice, let’s return to the Design node project. We know the source of the problem is a data quality rule on the tx_prepared dataset. Let’s fix it so the scenario can succeed.
In the Design node project, open the tx_prepared dataset.
Navigate to the Data Quality tab.
Click Edit Rules.
Select the Record count rule.
Decrease the min to
5000
and turn Off the soft min.Click Run Test to confirm it returns OK.
Create a second bundle#
With the scenario’s problem fixed, let’s update the existing deployment with a new bundle.
From the More Options (…) menu of the top navigation bar, click Bundles.
Click + New Bundle.
Name it
v2
.In the release notes, add
Fixed data quality rule
.Click Create.
Note
Note how when creating the second bundle, the configuration of the previous one is inherited. In this case, the uploaded dataset and the managed folder are already included.
Deploy the new bundle#
The process for deploying the new bundle is the same as for the first one.
Click on the newly-created v2 bundle, and click Publish on Deployer.
Confirm that you indeed want to Publish on Deployer.
Click to Open in Deployer to view the bundle details on the Deployer.
Once on the Deployer, click Deploy on the v2 bundle.
Dataiku gives the option to create a new deployment or update the existing one.
Since this is a new version of an existing deployment, verify Update is selected, and click OK.
Click OK again to confirm the deployment you want to edit.
We’re not done yet!
Navigate to the Status tab of the deployment, and note how Dataiku warns that the active bundle on the Automation node does not match the configured bundle.
Click the green Update button to deploy the new bundle. Then Confirm.
Navigate to the Deployments tab of the Project Deployer to see the new bundle as the currently deployed version of this project.
Activate a scenario from the Deployer#
Previously we activated the scenario directly from the Automation node project. Now let’s control it from the Project Deployer.
Open the deployment, and navigate to the Settings tab.
Go to the Scenarios panel.
Uncheck the box for Disable automatic triggers.
Click Activate All to enable the auto-triggers for any scenario.
Click Save and Update and then Confirm.
Tip
Once you’ve done this, verify the v2 scenario successfully! You can check this on the Automation project, the Project Deployer, or on the Unified Monitoring page (keeping in mind the synchronization interval).
Revert to a previous bundle#
It’s also important to be able to revert to an earlier version, should a newer bundle not work as expected. Let’s demonstrate that now.
From the Deployments tab of the Deployer, find the project in the left hand panel.
Click Deploy next to the v1 bundle.
With Update selected, click OK, and confirm this is correct.
Now on the Settings tab with v1 as the source bundle, click the green Update button, and Confirm the change.
Important
If you return to the Status tab of this deployment, or open the project homepage on the Automation node, you’ll see that v1 is once again the active bundle running in production.
Before signing off, be sure to disable automatic triggers for this deployment either from the Project Deployer or the Automation project!
See also
See the reference documentation to learn more about reverting bundles.
What’s next?#
Congratulations! To recap, in this tutorial, you:
Created a project bundle on the Design node.
Published a bundle to the Automation node via the Deployer.
Activated (and disabled) a scenario to run on the Automation node.
Saw where to monitor the health of deployments.
Switched bundle versions within a deployment.
Now that you have seen the batch processing framework for production, your next step may be to examine the real-time API scoring method of production presented in the API Deployment course.
See also
For more information on batch processing, please refer to the reference documentation on Production deployments and bundles.